What Is SMILES Notation?
SMILES – Simplified Molecular Input Line Entry System – is the most widely used text format for representing chemical structures. Developed by David Weininger in 1988, it encodes molecular graphs as compact ASCII strings. Every atom, bond, ring, branch, and stereocenter in a molecule can be described in a single line of text that is both human-readable and machine-parseable.
If you work in cheminformatics, drug discovery, or computational chemistry, SMILES is unavoidable. It is the default input format for molecular property calculators, docking tools, machine learning models, and chemical databases like PubChem, ChEMBL, and ZINC. When you call SciRouter's chemistry APIs, SMILES is how you specify your molecules.
Basic Syntax Rules
Atoms
Atoms are represented by their chemical symbols. Organic subset atoms (B, C, N, O, P, S, F, Cl, Br, I) can be written without brackets. All other atoms, or atoms with non-default properties (charge, isotope, explicit hydrogen count), must be enclosed in square brackets.
C– carbon (sp3, implicit hydrogens to fill valence)N– nitrogen[Fe]– iron (not in organic subset, needs brackets)[NH4+]– ammonium ion (charged, explicit H count)[13C]– carbon-13 isotope
Bonds
Single bonds between organic subset atoms are implicit – just write the atoms next to each other. Other bond types use explicit symbols:
CC– ethane (C-C single bond, implicit)C=C– ethylene (double bond)C#C– acetylene (triple bond)c1ccccc1– benzene (aromatic bonds, lowercase atoms)
Aromatic atoms are written in lowercase. Benzene is c1ccccc1 rather than C1=CC=CC=C1, though both are valid. The lowercase notation is called the aromatic SMILES form and is generally preferred for readability.
Branches
Branches off the main chain are enclosed in parentheses. The branch starts from the atom immediately before the opening parenthesis:
CC(C)C– isobutane (2-methylpropane)CC(=O)O– acetic acid (methyl group, then a branch with C=O and C-O)CC(C)(C)C– neopentane (three methyl branches on central carbon)
Rings
Rings are denoted by matching digits after the atoms where the ring opens and closes. The digit indicates that a bond connects these two atoms:
C1CCC1– cyclobutane (ring opens at first C, closes at last C)c1ccccc1– benzene (aromatic six-membered ring)C1CC2CCCCC2CC1– decalin (two fused rings using digits 1 and 2)
For molecules with more than 9 rings, use the percent notation: %10, %11, etc. This is rare for typical drug molecules but appears in complex natural products.
Stereochemistry
SMILES can encode both tetrahedral chirality and double bond geometry:
C(/F)=C/Cl– trans-1-chloro-2-fluoroethylene (E configuration)C(/F)=C\Cl– cis isomer (Z configuration)[C@@H](F)(Cl)Br– R-fluorochlorobromomethane (tetrahedral chirality)[C@H](F)(Cl)Br– S-enantiomer
The @ and @@ symbols specify the arrangement of neighbors around a chiral center when viewed from the first neighbor listed. If stereochemistry is omitted, the SMILES is considered unspecified at that center.
10 Common Molecules and Their SMILES
Here are ten well-known molecules with their SMILES strings. These make good test inputs for API calls and help build intuition for reading SMILES:
Molecule SMILES MW (g/mol)
─────────────────────────────────────────────────────────────────────────────
Ethanol CCO 46.07
Aspirin CC(=O)Oc1ccccc1C(=O)O 180.16
Caffeine Cn1c(=O)c2c(ncn2C)n(C)c1=O 194.19
Glucose OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O 180.16
Ibuprofen CC(C)Cc1ccc(C(C)C(=O)O)cc1 206.28
Penicillin G CC1(C)[C@@H](C(=O)O)N2C(=O)[C@@H](NC(=O)Cc3ccccc3)[C@H]2S1 334.39
Acetaminophen CC(=O)Nc1ccc(O)cc1 151.16
Dopamine NCCc1ccc(O)c(O)c1 153.18
Serotonin NCCc1c[nH]c2ccc(O)cc12 176.22
Cholesterol CC(C)CCCC(C)C1CCC2C3CC=C4CC(O)CCC4(C)C3CCC12C 386.65Reading SMILES: A Worked Example
Let's decode aspirin's SMILES step by step: CC(=O)Oc1ccccc1C(=O)O
C– methyl carbonC(=O)– carbonyl carbon (double bond to oxygen, branch)O– ester oxygen connecting to the ringc1ccccc1– benzene ring (aromatic carbons, ring opens and closes at digit 1)C(=O)– second carbonyl carbon (carboxylic acid)O– hydroxyl oxygen of the acid group
Reading left to right, you trace the molecular graph: a methyl group attached to an ester that bridges to a benzene ring, which carries a carboxylic acid group. This is exactly the structure of acetylsalicylic acid (aspirin).
Converting Between Molecular Formats via API
SMILES is not the only molecular format. InChI, MOL/SDF, and PDB are also widely used. SciRouter's format conversion endpoint converts between these formats with a single API call:
import requests
API_KEY = "sk-sci-your-api-key"
BASE = "https://api.scirouter.ai/v1"
headers = {"Authorization": f"Bearer {API_KEY}"}
# Convert caffeine from SMILES to InChI and MOL format
response = requests.post(f"{BASE}/chemistry/convert", headers=headers, json={
"smiles": "Cn1c(=O)c2c(ncn2C)n(C)c1=O",
"output_formats": ["inchi", "inchi_key", "mol", "canonical_smiles"]
})
result = response.json()
print(f"Canonical SMILES: {result['canonical_smiles']}")
print(f"InChI: {result['inchi']}")
print(f"InChI Key: {result['inchi_key']}")
# MOL block is a multi-line 3D format used by many tools
print(f"MOL block lines: {len(result['mol'].splitlines())}")This is useful when you have a SMILES string and need to feed it into a tool that expects InChI (like a chemical database), or when you receive an SDF file from a docking run and need to extract SMILES for further processing.
Calculating Molecular Properties from SMILES
One of the most common tasks in cheminformatics is computing physicochemical properties from a molecular structure. Traditionally this required installing RDKit locally – a C++ library with Python bindings that is notoriously difficult to set up on some systems. SciRouter's API gives you the same calculations with zero installation:
# Calculate properties for ibuprofen
response = requests.post(f"{BASE}/chemistry/properties", headers=headers, json={
"smiles": "CC(C)Cc1ccc(C(C)C(=O)O)cc1"
})
props = response.json()
print(f"Molecular weight: {props['molecular_weight']:.2f} g/mol")
print(f"LogP: {props['logp']:.2f}")
print(f"TPSA: {props['tpsa']:.1f} Ų")
print(f"H-bond donors: {props['hbd']}")
print(f"H-bond acceptors: {props['hba']}")
print(f"Rotatable bonds: {props['rotatable_bonds']}")
print(f"Lipinski violations: {props['lipinski_violations']}")Batch Property Calculation
For screening a compound library, send multiple SMILES in a single request:
# Calculate properties for multiple molecules at once
compounds = {
"aspirin": "CC(=O)Oc1ccccc1C(=O)O",
"caffeine": "Cn1c(=O)c2c(ncn2C)n(C)c1=O",
"ibuprofen": "CC(C)Cc1ccc(C(C)C(=O)O)cc1",
"acetaminophen": "CC(=O)Nc1ccc(O)cc1",
"dopamine": "NCCc1ccc(O)c(O)c1",
}
response = requests.post(f"{BASE}/chemistry/properties", headers=headers, json={
"smiles_list": list(compounds.values()),
"labels": list(compounds.keys())
})
results = response.json()["results"]
print(f"{'Name':<16} {'MW':>8} {'LogP':>6} {'TPSA':>6} {'Lipinski':>9}")
print("-" * 50)
for r in results:
print(f"{r['label']:<16} {r['molecular_weight']:>8.1f} "
f"{r['logp']:>6.2f} {r['tpsa']:>6.1f} "
f"{r['lipinski_violations']:>9}")SMILES in Drug Discovery Workflows
SMILES strings are the connective tissue that links different stages of a computational drug discovery pipeline. Here is how they flow through a typical workflow:
- Compound enumeration: Combinatorial chemistry tools generate SMILES for virtual libraries of candidate molecules.
- Property filtering: SMILES feed into property calculators to filter by drug-likeness, solubility, and other physicochemical criteria.
- Similarity searching: Chemical fingerprints derived from SMILES enable similarity searches against known active compounds.
- Molecular docking: Docking tools like DiffDock accept SMILES as ligand input and predict binding poses.
- Machine learning: SMILES are tokenized as input to molecular property prediction models, generative chemistry models, and ADMET predictors.
For a hands-on example of calculating molecular properties without installing RDKit, see our molecular properties from SMILES tutorial.
Common SMILES Pitfalls
Even experienced chemists make mistakes with SMILES. Here are the most frequent issues:
- Mismatched ring digits: Every ring-opening digit must have a matching ring-closing digit.
C1CCCis invalid because digit 1 never closes. - Unbalanced parentheses: Every opening parenthesis needs a closing one.
CC(=Ois invalid. - Valence errors: Carbon has valence 4.
C(C)(C)(C)(C)Cputs five bonds on the first carbon and is invalid. Some tools silently accept valence violations, which leads to garbage results downstream. - Aromatic vs. non-aromatic: Writing
c1ccc1(four-membered aromatic ring) is chemically invalid because cyclobutadiene is not aromatic. Use uppercase for non-aromatic rings:C1CCC1. - Missing stereochemistry: If your application requires distinguishing enantiomers, you must include
@/@@notation. Omitting it means the SMILES represents a mixture of all stereoisomers.
Canonicalization: Getting a Unique SMILES
Because the same molecule can be written as many different SMILES strings, databases and search tools use canonical SMILES – a single standardized representation. You can get canonical SMILES through SciRouter's format conversion endpoint:
# These are all valid SMILES for ethanol
ethanol_variants = ["CCO", "OCC", "C(O)C", "[CH3][CH2][OH]"]
for smiles in ethanol_variants:
resp = requests.post(f"{BASE}/chemistry/convert", headers=headers, json={
"smiles": smiles,
"output_formats": ["canonical_smiles"]
}).json()
print(f"{smiles:20s} -> {resp['canonical_smiles']}")
# All four will produce the same canonical SMILES: CCOCanonicalization is essential when comparing molecules, deduplicating compound libraries, or looking up molecules in a database. Always canonicalize before comparing SMILES strings, since string equality of non-canonical SMILES does not guarantee molecular identity.
Beyond SMILES: Other Molecular Representations
SMILES is the most popular but not the only molecular text format. Here is how it compares:
- InChI: Unique canonical identifier. Better for databases and registration. Harder to read. Use SciRouter's format conversion to translate between SMILES and InChI.
- SMARTS: A superset of SMILES designed for substructure searching. Supports wildcards, atom lists, and recursive patterns. Used for reaction mapping and toxicophore screening.
- MOL/SDF: Connection table format with optional 3D coordinates. Used by docking tools and crystal structure databases. More verbose but encodes geometry.
- SELFIES: Self-referencing embedded strings. A newer format designed for machine learning that guarantees every string maps to a valid molecule. Increasingly used in generative chemistry.
Next Steps
Now that you understand SMILES notation, you can use it as input to any of SciRouter's chemistry tools:
- Calculate physicochemical properties with the molecular properties endpoint
- Convert between formats using format conversion
- Dock molecules against protein targets with DiffDock
- Learn how to calculate molecular properties without installing RDKit
All endpoints accept standard SMILES as input. For best results, use canonical SMILES and validate your structures before submitting to computationally intensive tools like docking. Sign up for a free API key to start exploring.