What Is SMILES and Why Should You Learn It?
SMILES – Simplified Molecular-Input Line-Entry System – is a way to write chemical structures as plain text. Instead of drawing hexagons and lines on paper, you type a string like c1ccccc1 for benzene or CC(=O)O for acetic acid. That string encodes every atom, every bond, every ring, and every stereocenter in the molecule.
SMILES was invented by David Weininger in 1988, and three decades later it remains the lingua franca of computational chemistry. Every cheminformatics database (ChEMBL, PubChem, DrugBank, ZINC), every molecular property prediction tool, every generative chemistry model, and every docking platform uses SMILES as its primary input format. If you want to work with molecules computationally – whether you are a researcher, a developer building biotech tools, or an AI engineer training models on chemical data – you need to read and write SMILES.
This guide will take you from zero to fluency. We will cover every element of the SMILES language, work through 10 practice molecules of increasing complexity, compare SMILES to other notations, address common mistakes, and show you how to validate and convert SMILES programmatically using the SciRouter API.
The Basics: Atoms and Bonds
Atoms
In SMILES, atoms are represented by their chemical symbols. Organic subset atoms (B, C, N, O, P, S, F, Cl, Br, I) can be written without brackets. All other atoms must be enclosed in square brackets. Hydrogen atoms are usually implicit – the system calculates them automatically based on valence rules.
C– carbon (sp3, with implicit hydrogens: CH4 = methane)N– nitrogenO– oxygen[Fe]– iron (needs brackets because it is not in the organic subset)[Cu+2]– copper(II) ion (charge specified inside brackets)
Lowercase letters indicate aromatic atoms: c is an aromatic carbon,n is an aromatic nitrogen. This distinction matters for ring systems, which we will cover shortly.
Bonds
Bonds between atoms are specified by symbols placed between the atom symbols. Single bonds are usually implicit (just write the atoms next to each other), but can be made explicit with a hyphen.
- (implicit or
-) – single bond:CCis ethane (C-C) =– double bond:C=Cis ethylene#– triple bond:C#Cis acetylene:– aromatic bond (usually implied by lowercase atoms)
Let us build our first real molecule. Ethanol (drinking alcohol) has two carbons, one oxygen, and a single bond between each: CCO. That is it. The hydrogens are implicit: the first carbon has 3 hydrogens (CH3), the second has 2 (CH2), and the oxygen has 1 (OH). So CCO represents CH3-CH2-OH.
Branches: Parentheses for Side Chains
Linear molecules are easy, but most molecules have branches. In SMILES, a branch is enclosed in parentheses. The branch starts from the atom immediately before the opening parenthesis.
Consider isobutane (2-methylpropane). The main chain is three carbons: CCC. The methyl branch comes off the middle carbon: CC(C)C. The parentheses around C mean "there is a branch (a methyl group) attached to this carbon."
You can nest branches and have multiple branches on the same atom:
CC(C)C– isobutane (one branch off the second carbon)CC(C)(C)C– neopentane (two branches off the second carbon)CC(=O)O– acetic acid (the second carbon has a =O branch and continues to -OH)CC(=O)Oc1ccccc1C(=O)O– aspirin (branches and rings combined)
The key rule: after closing a parenthesis, you return to the atom where the branch started. So in CC(=O)O, the chain goes C-C, then branches to =O, returns to the second C, and continues to -O (then implicit -H on that oxygen).
Rings: Numbered Closure Digits
Rings are represented by ring closure digits. You assign a number to an atom where the ring opens, then use the same number at the atom where the ring closes.
Cyclohexane is six carbons in a ring: C1CCCCC1. The first carbon is labeled 1, and when we reach the last carbon, we close the ring by writing 1 again. The system draws a bond between the two atoms that share the same digit.
Benzene uses aromatic notation: c1ccccc1. The lowercase catoms indicate aromaticity, and the ring closure 1 completes the six-membered aromatic ring. You can also write benzene in Kekule form (with alternating single and double bonds): C1=CC=CC=C1.
For molecules with multiple rings, use different digits for each ring:
c1ccc2ccccc2c1– naphthalene (two fused aromatic rings, closures 1 and 2)C1CC1– cyclopropane (three-membered ring)C1CCOC1– tetrahydrofuran (five-membered ring with oxygen)c1ccncc1– pyridine (six-membered aromatic ring with nitrogen)
If you run out of single digits (rare, but it happens in complex natural products), use %10, %11, etc. for ring closure numbers above 9.
Stereochemistry: Specifying 3D Arrangement
Cis/Trans (E/Z) Isomerism
Double bonds can have cis or trans geometry. SMILES uses / and \ to specify the arrangement of atoms around a double bond:
F/C=C/F– trans-1,2-difluoroethylene (F atoms on opposite sides)F/C=C\F– cis-1,2-difluoroethylene (F atoms on the same side)
The slash and backslash describe the direction of bonds relative to the double bond. Two slashes in the same direction (/C=C/) mean trans. A slash and a backslash (/C=C\) mean cis. This notation is intuitive once you visualize the slashes as upward and downward bonds.
Chirality (R/S Configuration)
Tetrahedral stereocenters are specified with @ and @@inside square brackets. The @ symbol means the remaining substituents are arranged counterclockwise when viewed from the first neighbor; @@means clockwise.
[C@@H](O)(F)Cl– specifies a particular enantiomer of fluorochloromethanol[C@H](O)(F)Cl– the opposite enantiomerC[C@@H](O)CC– (R)-2-butanol
Chirality in SMILES is critical for drug molecules, where the two enantiomers can have completely different biological activities. Thalidomide is the notorious example: one enantiomer is a sedative, the other causes birth defects. The SMILES for (S)-thalidomide is O=C1CC[C@H](N2C(=O)c3ccccc3C2=O)C(=O)N1.
Practice: 10 Molecules from Simple to Complex
The best way to learn SMILES is to practice. Here are 10 real molecules with increasing complexity. For each one, try to understand the SMILES before reading the explanation.
1. Methane – C
The simplest possible SMILES. One carbon with four implicit hydrogens.
2. Water – O
One oxygen with two implicit hydrogens. Yes, water has a valid SMILES.
3. Ethanol – CCO
Two carbons bonded together, then an oxygen. Implicit hydrogens give CH3-CH2-OH.
4. Acetic acid – CC(=O)O
A carbon (CH3) bonded to another carbon, which has a double-bond branch to oxygen (=O, the carbonyl) and continues to a hydroxyl oxygen (OH). This is vinegar.
5. Benzene – c1ccccc1
Six aromatic carbons in a ring. The ring closure digit 1 connects the first and last carbon. Each aromatic carbon has one implicit hydrogen.
6. Aspirin – CC(=O)Oc1ccccc1C(=O)O
An acetyl group (CC(=O)O) connected to a benzene ring (c1ccccc1) which also bears a carboxylic acid group (C(=O)O). The ester oxygen bridges the acetyl group to the ring.
7. Caffeine – Cn1c(=O)c2c(ncn2C)n(C)c1=O
A purine-derived xanthine skeleton with three N-methyl groups. This molecule has two fused rings (a six-membered pyrimidinedione and a five-membered imidazole), two carbonyl oxygens, and three methyl groups on nitrogens. Ring closures 1 and 2 handle the fused ring system.
8. Ibuprofen – CC(C)Cc1ccc(cc1)C(C)C(=O)O
An isobutyl group (CC(C)C) attached to a para-substituted benzene ring, with a propionic acid group (C(C)C(=O)O) on the other side. The 1 ring closures define the benzene ring, and the substitution pattern gives the 1,4- (para) arrangement.
9. Sotorasib – C=CC(=O)N1CCN(c2nc(Nc3ccc(N4CCN(C)CC4)c(C)c3)c3[nH]cnc3n2)CC1
The first-ever KRAS G12C inhibitor (Lumakras). This complex SMILES encodes an acrylamide warhead (C=CC(=O)) connected to a piperazine ring (N1CCN...CC1), a pyrimidine-pyrrole bicyclic core (ring closures 2 and 3), an aniline linker, and a methylpiperazine substituent (ring closure 4). Four ring systems in one molecule.
10. Taxol (Paclitaxel) – CC1=C2C(C(=O)C3(C(CC4C(C3C(C(C2(C)C(CC1OC(=O)C(C(c5ccccc5)NC(=O)c6ccccc6)O)O)OC(=O)C7CCCCC7)OC(=O)C)(C4(C)C)OC(=O)C)O)C)OC(=O)C
The famous anticancer natural product from the Pacific yew tree. This monster SMILES contains multiple fused rings, stereocenters (omitted here for clarity), ester groups, amide bonds, and phenyl rings. At 150+ characters, it pushes the limits of what a human can reasonably parse – which is why we have computational tools.
SMILES vs InChI vs IUPAC: Which Notation When?
SMILES is not the only way to represent molecules as text. Two other systems are widely used, and each has distinct strengths.
InChI (International Chemical Identifier)
InChI was developed by IUPAC and NIST as a standardized, canonical chemical identifier. Unlike SMILES, InChI always produces a single unique representation for each molecule (canonical by definition). An InChI string looks like this:InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 (ethanol).
InChI is better than SMILES for database deduplication and exact-match searching because canonicalization is built-in. Its companion, InChIKey, is a fixed-length 27-character hash that enables fast database lookups. However, InChI is nearly impossible for humans to read and is not commonly used as input to computational tools.
IUPAC Nomenclature
IUPAC names are the systematic names used by chemists: "2-acetoxybenzoic acid" for aspirin, "(RS)-2-(4-(2-methylpropyl)phenyl)propanoic acid" for ibuprofen. IUPAC names are human-readable and unambiguous, but they are verbose, hard to parse computationally, and not standardized across all edge cases.
When to Use Each
- SMILES – API inputs, cheminformatics tools, generative models, database storage, molecular editors. The default choice for almost all computational chemistry workflows.
- InChI / InChIKey – Database deduplication, exact-match queries, cross-database linking (PubChem, ChEMBL, UniChem). Use InChIKey when you need a fixed-length unique identifier.
- IUPAC – Human communication, publications, regulatory filings, patent claims. Never use IUPAC as computational input.
SciRouter's format conversion endpoint converts between all three notations, so you can accept any format and standardize to canonical SMILES for downstream processing.
Common SMILES Mistakes and How to Fix Them
Mistake 1: Mismatched Ring Closure Digits
Every ring opening must have a corresponding closure. Writing C1CCCCC(without closing 1) is invalid. Fix: always close rings with the matching digit: C1CCCCC1.
Mistake 2: Unbalanced Parentheses
Branches must have balanced parentheses. CC(=O is invalid because the parenthesis is never closed. Fix: CC(=O)O. Use a text editor with bracket matching to catch this.
Mistake 3: Wrong Valence
Carbon normally has 4 bonds, nitrogen 3, oxygen 2. Writing C(C)(C)(C)(C)Cgives carbon 5 bonds, which is invalid for standard organic chemistry. Fix: check that the total bond order plus implicit hydrogens equals the atom's normal valence.
Mistake 4: Confusing Aromatic and Aliphatic Atoms
Lowercase c means aromatic carbon, uppercase C means aliphatic carbon. Writing c1ccccc1 is benzene (aromatic, correct). Writing C1CCCCC1 is cyclohexane (aliphatic, correct). Mixing them (c1CCCCc1) is often invalid because partial aromaticity is not well-defined in the Kekulization algorithm.
Mistake 5: Forgetting Charges
Charged atoms must be in brackets with the charge specified. A protonated amine is[NH4+], not N+. A carboxylate is [O-], notO-. If your SMILES string behaves unexpectedly, check whether ionic species have proper bracket notation.
Validating and Converting SMILES with Code
Before sending a SMILES string to any prediction tool, validate it. Invalid SMILES will cause errors or, worse, silently produce incorrect results. Here is how to validate and convert SMILES using the SciRouter API and Python.
import scirouter
client = scirouter.SciRouter(api_key="sk-sci-YOUR_KEY")
# Validate and canonicalize a SMILES string
smiles = "C(C)(=O)Oc1ccccc1C(=O)O" # Aspirin (non-canonical form)
result = client.chemistry.convert(
smiles=smiles,
output_format="canonical_smiles"
)
print(f"Input: {smiles}")
print(f"Canonical: {result.canonical_smiles}")
# Output: CC(=O)Oc1ccccc1C(=O)O
# Convert to InChI
inchi_result = client.chemistry.convert(
smiles=smiles,
output_format="inchi"
)
print(f"InChI: {inchi_result.inchi}")
print(f"InChIKey: {inchi_result.inchi_key}")
# Batch validate a list of SMILES
test_smiles = [
"CCO", # Valid: ethanol
"CC(=O)Oc1ccccc1C(=O)O", # Valid: aspirin
"C1CCCCC", # Invalid: unclosed ring
"CC(C)(C)(C)C", # Invalid: pentavalent carbon
"c1ccccc1", # Valid: benzene
]
for smi in test_smiles:
try:
result = client.chemistry.convert(smiles=smi, output_format="canonical_smiles")
print(f"VALID: {smi} -> {result.canonical_smiles}")
except Exception as e:
print(f"INVALID: {smi} -> {e}")For local validation without API calls, RDKit is the standard library:
from rdkit import Chem
def validate_smiles(smiles: str) -> bool:
"""Return True if SMILES is valid, False otherwise."""
mol = Chem.MolFromSmiles(smiles)
return mol is not None
def canonicalize(smiles: str) -> str:
"""Return canonical SMILES or raise ValueError."""
mol = Chem.MolFromSmiles(smiles)
if mol is None:
raise ValueError(f"Invalid SMILES: {smiles}")
return Chem.MolToSmiles(mol)
# Examples
print(validate_smiles("CCO")) # True
print(validate_smiles("C1CCCCC")) # False
print(canonicalize("C(C)(=O)Oc1ccccc1C(=O)O")) # CC(=O)Oc1ccccc1C(=O)OProgrammatic SMILES with the SciRouter API
Once you can read and write SMILES, the real power comes from using them programmatically. Every SciRouter chemistry endpoint accepts SMILES as input. Here are the most common operations.
Calculate Molecular Properties from SMILES
import scirouter
client = scirouter.SciRouter(api_key="sk-sci-YOUR_KEY")
# Calculate properties for caffeine
result = client.chemistry.properties(smiles="Cn1c(=O)c2c(ncn2C)n(C)c1=O")
print(f"Molecular weight: {result.molecular_weight:.2f} Da")
print(f"LogP: {result.logp:.2f}")
print(f"H-bond donors: {result.h_bond_donors}")
print(f"H-bond acceptors: {result.h_bond_acceptors}")
print(f"Rotatable bonds: {result.rotatable_bonds}")
print(f"TPSA: {result.tpsa:.1f} A^2")
print(f"Lipinski violations: {result.lipinski_violations}")Compare Molecules by Similarity
# Compare ibuprofen to naproxen (both are NSAIDs)
similarity = client.chemistry.similarity(
smiles_a="CC(C)Cc1ccc(cc1)C(C)C(=O)O", # Ibuprofen
smiles_b="COc1ccc2cc(ccc2c1)C(C)C(=O)O", # Naproxen
)
print(f"Tanimoto similarity: {similarity.tanimoto:.3f}")
# Expected: ~0.55 (same pharmacological class, different scaffolds)Convert Between Formats
# SMILES to InChI to SMILES round-trip
aspirin_smiles = "CC(=O)Oc1ccccc1C(=O)O"
to_inchi = client.chemistry.convert(smiles=aspirin_smiles, output_format="inchi")
print(f"InChI: {to_inchi.inchi}")
# InChI back to SMILES
to_smiles = client.chemistry.convert(inchi=to_inchi.inchi, output_format="canonical_smiles")
print(f"Back to SMILES: {to_smiles.canonical_smiles}")SMILES in Machine Learning and Generative Chemistry
SMILES strings are the primary input representation for molecular machine learning models. Transformer-based models like ChemBERTa tokenize SMILES into subword units and learn chemical representations from millions of molecules. Generative models like REINVENT4 generate new molecules by sampling SMILES strings character by character, constrained by the grammar rules of the notation.
This has a practical implication: if you are training or fine-tuning molecular ML models, your SMILES must be clean and canonical. Non-canonical SMILES introduce unnecessary variation that makes training harder. Invalid SMILES waste training samples. Always canonicalize your SMILES dataset before training, and validate generated SMILES before evaluating them.
SMILES augmentation – generating multiple valid but non-canonical SMILES for the same molecule – is sometimes used as a data augmentation technique in training. Each randomized SMILES traverses the molecular graph differently, giving the model multiple views of the same molecule. SciRouter's format conversion endpoint returns canonical SMILES, which is what you want for production inference. For training data augmentation, use RDKit'sChem.MolToSmiles(mol, doRandom=True).
Quick Reference: SMILES Symbols
Here is a compact reference for all SMILES notation elements covered in this guide:
- Atoms: C, N, O, S, P, F, Cl, Br, I (organic subset); [Fe], [Cu+2], [NH4+] (bracketed)
- Aromatic atoms: c, n, o, s (lowercase)
- Bonds: (implicit) single, = double, # triple, : aromatic
- Branches: ( ) parentheses
- Rings: digits 1-9 or %10-%99 for ring closures
- Stereochemistry: / and \ for cis/trans; @ and @@ for chirality
- Charges: [NH4+], [O-], [Fe+3] (inside brackets)
- Disconnected fragments: . (period) between components
- Isotopes: [2H], [13C], [14N] (mass number before atom symbol)
- Explicit hydrogens: [CH4], [NH3], [OH2] (inside brackets)
Try It Now: Convert and Validate SMILES
Practice reading and writing SMILES using the SMILES Converter free tool – no account or API key required. Paste a SMILES string and instantly see the canonical form, InChI, InChIKey, molecular formula, and a 2D structure rendering.
For programmatic access, create a free SciRouter account at scirouter.ai/signup and use the format conversion and molecular properties endpoints to validate, convert, and analyze molecules in your own workflows. The similarity endpoint lets you compare molecules to find analogs, cluster libraries, and assess diversity.
SMILES notation is the foundation of everything in computational chemistry. Once you can read CC(=O)Oc1ccccc1C(=O)O and see aspirin in your mind, every cheminformatics tool, every database, and every generative chemistry model becomes accessible. The notation is simple, the vocabulary is small, and the payoff is enormous.