ChemistryChemistry Fundamentals

SMILES Notation: Complete Beginner's Guide to Molecular Line Notation

Learn SMILES notation from scratch. Reading and writing molecular structures, practice molecules, SMILES vs InChI, and programmatic tools.

Ryan Bethencourt
April 8, 2026
10 min read

What Is SMILES and Why Should You Learn It?

SMILES – Simplified Molecular-Input Line-Entry System – is a way to write chemical structures as plain text. Instead of drawing hexagons and lines on paper, you type a string like c1ccccc1 for benzene or CC(=O)O for acetic acid. That string encodes every atom, every bond, every ring, and every stereocenter in the molecule.

SMILES was invented by David Weininger in 1988, and three decades later it remains the lingua franca of computational chemistry. Every cheminformatics database (ChEMBL, PubChem, DrugBank, ZINC), every molecular property prediction tool, every generative chemistry model, and every docking platform uses SMILES as its primary input format. If you want to work with molecules computationally – whether you are a researcher, a developer building biotech tools, or an AI engineer training models on chemical data – you need to read and write SMILES.

This guide will take you from zero to fluency. We will cover every element of the SMILES language, work through 10 practice molecules of increasing complexity, compare SMILES to other notations, address common mistakes, and show you how to validate and convert SMILES programmatically using the SciRouter API.

The Basics: Atoms and Bonds

Atoms

In SMILES, atoms are represented by their chemical symbols. Organic subset atoms (B, C, N, O, P, S, F, Cl, Br, I) can be written without brackets. All other atoms must be enclosed in square brackets. Hydrogen atoms are usually implicit – the system calculates them automatically based on valence rules.

  • C – carbon (sp3, with implicit hydrogens: CH4 = methane)
  • N – nitrogen
  • O – oxygen
  • [Fe] – iron (needs brackets because it is not in the organic subset)
  • [Cu+2] – copper(II) ion (charge specified inside brackets)

Lowercase letters indicate aromatic atoms: c is an aromatic carbon,n is an aromatic nitrogen. This distinction matters for ring systems, which we will cover shortly.

Bonds

Bonds between atoms are specified by symbols placed between the atom symbols. Single bonds are usually implicit (just write the atoms next to each other), but can be made explicit with a hyphen.

  • (implicit or -) – single bond: CC is ethane (C-C)
  • = – double bond: C=C is ethylene
  • # – triple bond: C#C is acetylene
  • : – aromatic bond (usually implied by lowercase atoms)

Let us build our first real molecule. Ethanol (drinking alcohol) has two carbons, one oxygen, and a single bond between each: CCO. That is it. The hydrogens are implicit: the first carbon has 3 hydrogens (CH3), the second has 2 (CH2), and the oxygen has 1 (OH). So CCO represents CH3-CH2-OH.

Branches: Parentheses for Side Chains

Linear molecules are easy, but most molecules have branches. In SMILES, a branch is enclosed in parentheses. The branch starts from the atom immediately before the opening parenthesis.

Consider isobutane (2-methylpropane). The main chain is three carbons: CCC. The methyl branch comes off the middle carbon: CC(C)C. The parentheses around C mean "there is a branch (a methyl group) attached to this carbon."

You can nest branches and have multiple branches on the same atom:

  • CC(C)C – isobutane (one branch off the second carbon)
  • CC(C)(C)C – neopentane (two branches off the second carbon)
  • CC(=O)O – acetic acid (the second carbon has a =O branch and continues to -OH)
  • CC(=O)Oc1ccccc1C(=O)O – aspirin (branches and rings combined)

The key rule: after closing a parenthesis, you return to the atom where the branch started. So in CC(=O)O, the chain goes C-C, then branches to =O, returns to the second C, and continues to -O (then implicit -H on that oxygen).

Rings: Numbered Closure Digits

Rings are represented by ring closure digits. You assign a number to an atom where the ring opens, then use the same number at the atom where the ring closes.

Cyclohexane is six carbons in a ring: C1CCCCC1. The first carbon is labeled 1, and when we reach the last carbon, we close the ring by writing 1 again. The system draws a bond between the two atoms that share the same digit.

Benzene uses aromatic notation: c1ccccc1. The lowercase catoms indicate aromaticity, and the ring closure 1 completes the six-membered aromatic ring. You can also write benzene in Kekule form (with alternating single and double bonds): C1=CC=CC=C1.

For molecules with multiple rings, use different digits for each ring:

  • c1ccc2ccccc2c1 – naphthalene (two fused aromatic rings, closures 1 and 2)
  • C1CC1 – cyclopropane (three-membered ring)
  • C1CCOC1 – tetrahydrofuran (five-membered ring with oxygen)
  • c1ccncc1 – pyridine (six-membered aromatic ring with nitrogen)

If you run out of single digits (rare, but it happens in complex natural products), use %10, %11, etc. for ring closure numbers above 9.

Stereochemistry: Specifying 3D Arrangement

Cis/Trans (E/Z) Isomerism

Double bonds can have cis or trans geometry. SMILES uses / and \ to specify the arrangement of atoms around a double bond:

  • F/C=C/F – trans-1,2-difluoroethylene (F atoms on opposite sides)
  • F/C=C\F – cis-1,2-difluoroethylene (F atoms on the same side)

The slash and backslash describe the direction of bonds relative to the double bond. Two slashes in the same direction (/C=C/) mean trans. A slash and a backslash (/C=C\) mean cis. This notation is intuitive once you visualize the slashes as upward and downward bonds.

Chirality (R/S Configuration)

Tetrahedral stereocenters are specified with @ and @@inside square brackets. The @ symbol means the remaining substituents are arranged counterclockwise when viewed from the first neighbor; @@means clockwise.

  • [C@@H](O)(F)Cl – specifies a particular enantiomer of fluorochloromethanol
  • [C@H](O)(F)Cl – the opposite enantiomer
  • C[C@@H](O)CC – (R)-2-butanol

Chirality in SMILES is critical for drug molecules, where the two enantiomers can have completely different biological activities. Thalidomide is the notorious example: one enantiomer is a sedative, the other causes birth defects. The SMILES for (S)-thalidomide is O=C1CC[C@H](N2C(=O)c3ccccc3C2=O)C(=O)N1.

Note
If you omit stereochemistry from a SMILES string, the molecule is treated as a racemic mixture (equal parts of both enantiomers). For drug discovery applications, always include stereochemistry when it is known, as it affects binding, metabolism, and toxicity.

Practice: 10 Molecules from Simple to Complex

The best way to learn SMILES is to practice. Here are 10 real molecules with increasing complexity. For each one, try to understand the SMILES before reading the explanation.

1. Methane – C

The simplest possible SMILES. One carbon with four implicit hydrogens.

2. Water – O

One oxygen with two implicit hydrogens. Yes, water has a valid SMILES.

3. Ethanol – CCO

Two carbons bonded together, then an oxygen. Implicit hydrogens give CH3-CH2-OH.

4. Acetic acid – CC(=O)O

A carbon (CH3) bonded to another carbon, which has a double-bond branch to oxygen (=O, the carbonyl) and continues to a hydroxyl oxygen (OH). This is vinegar.

5. Benzene – c1ccccc1

Six aromatic carbons in a ring. The ring closure digit 1 connects the first and last carbon. Each aromatic carbon has one implicit hydrogen.

6. Aspirin – CC(=O)Oc1ccccc1C(=O)O

An acetyl group (CC(=O)O) connected to a benzene ring (c1ccccc1) which also bears a carboxylic acid group (C(=O)O). The ester oxygen bridges the acetyl group to the ring.

7. Caffeine – Cn1c(=O)c2c(ncn2C)n(C)c1=O

A purine-derived xanthine skeleton with three N-methyl groups. This molecule has two fused rings (a six-membered pyrimidinedione and a five-membered imidazole), two carbonyl oxygens, and three methyl groups on nitrogens. Ring closures 1 and 2 handle the fused ring system.

8. Ibuprofen – CC(C)Cc1ccc(cc1)C(C)C(=O)O

An isobutyl group (CC(C)C) attached to a para-substituted benzene ring, with a propionic acid group (C(C)C(=O)O) on the other side. The 1 ring closures define the benzene ring, and the substitution pattern gives the 1,4- (para) arrangement.

9. Sotorasib – C=CC(=O)N1CCN(c2nc(Nc3ccc(N4CCN(C)CC4)c(C)c3)c3[nH]cnc3n2)CC1

The first-ever KRAS G12C inhibitor (Lumakras). This complex SMILES encodes an acrylamide warhead (C=CC(=O)) connected to a piperazine ring (N1CCN...CC1), a pyrimidine-pyrrole bicyclic core (ring closures 2 and 3), an aniline linker, and a methylpiperazine substituent (ring closure 4). Four ring systems in one molecule.

10. Taxol (Paclitaxel) – CC1=C2C(C(=O)C3(C(CC4C(C3C(C(C2(C)C(CC1OC(=O)C(C(c5ccccc5)NC(=O)c6ccccc6)O)O)OC(=O)C7CCCCC7)OC(=O)C)(C4(C)C)OC(=O)C)O)C)OC(=O)C

The famous anticancer natural product from the Pacific yew tree. This monster SMILES contains multiple fused rings, stereocenters (omitted here for clarity), ester groups, amide bonds, and phenyl rings. At 150+ characters, it pushes the limits of what a human can reasonably parse – which is why we have computational tools.

SMILES vs InChI vs IUPAC: Which Notation When?

SMILES is not the only way to represent molecules as text. Two other systems are widely used, and each has distinct strengths.

InChI (International Chemical Identifier)

InChI was developed by IUPAC and NIST as a standardized, canonical chemical identifier. Unlike SMILES, InChI always produces a single unique representation for each molecule (canonical by definition). An InChI string looks like this:InChI=1S/C2H6O/c1-2-3/h3H,2H2,1H3 (ethanol).

InChI is better than SMILES for database deduplication and exact-match searching because canonicalization is built-in. Its companion, InChIKey, is a fixed-length 27-character hash that enables fast database lookups. However, InChI is nearly impossible for humans to read and is not commonly used as input to computational tools.

IUPAC Nomenclature

IUPAC names are the systematic names used by chemists: "2-acetoxybenzoic acid" for aspirin, "(RS)-2-(4-(2-methylpropyl)phenyl)propanoic acid" for ibuprofen. IUPAC names are human-readable and unambiguous, but they are verbose, hard to parse computationally, and not standardized across all edge cases.

When to Use Each

  • SMILES – API inputs, cheminformatics tools, generative models, database storage, molecular editors. The default choice for almost all computational chemistry workflows.
  • InChI / InChIKey – Database deduplication, exact-match queries, cross-database linking (PubChem, ChEMBL, UniChem). Use InChIKey when you need a fixed-length unique identifier.
  • IUPAC – Human communication, publications, regulatory filings, patent claims. Never use IUPAC as computational input.

SciRouter's format conversion endpoint converts between all three notations, so you can accept any format and standardize to canonical SMILES for downstream processing.

Common SMILES Mistakes and How to Fix Them

Mistake 1: Mismatched Ring Closure Digits

Every ring opening must have a corresponding closure. Writing C1CCCCC(without closing 1) is invalid. Fix: always close rings with the matching digit: C1CCCCC1.

Mistake 2: Unbalanced Parentheses

Branches must have balanced parentheses. CC(=O is invalid because the parenthesis is never closed. Fix: CC(=O)O. Use a text editor with bracket matching to catch this.

Mistake 3: Wrong Valence

Carbon normally has 4 bonds, nitrogen 3, oxygen 2. Writing C(C)(C)(C)(C)Cgives carbon 5 bonds, which is invalid for standard organic chemistry. Fix: check that the total bond order plus implicit hydrogens equals the atom's normal valence.

Mistake 4: Confusing Aromatic and Aliphatic Atoms

Lowercase c means aromatic carbon, uppercase C means aliphatic carbon. Writing c1ccccc1 is benzene (aromatic, correct). Writing C1CCCCC1 is cyclohexane (aliphatic, correct). Mixing them (c1CCCCc1) is often invalid because partial aromaticity is not well-defined in the Kekulization algorithm.

Mistake 5: Forgetting Charges

Charged atoms must be in brackets with the charge specified. A protonated amine is[NH4+], not N+. A carboxylate is [O-], notO-. If your SMILES string behaves unexpectedly, check whether ionic species have proper bracket notation.

Validating and Converting SMILES with Code

Before sending a SMILES string to any prediction tool, validate it. Invalid SMILES will cause errors or, worse, silently produce incorrect results. Here is how to validate and convert SMILES using the SciRouter API and Python.

python
import scirouter

client = scirouter.SciRouter(api_key="sk-sci-YOUR_KEY")

# Validate and canonicalize a SMILES string
smiles = "C(C)(=O)Oc1ccccc1C(=O)O"  # Aspirin (non-canonical form)

result = client.chemistry.convert(
    smiles=smiles,
    output_format="canonical_smiles"
)
print(f"Input:     {smiles}")
print(f"Canonical: {result.canonical_smiles}")
# Output: CC(=O)Oc1ccccc1C(=O)O

# Convert to InChI
inchi_result = client.chemistry.convert(
    smiles=smiles,
    output_format="inchi"
)
print(f"InChI:     {inchi_result.inchi}")
print(f"InChIKey:  {inchi_result.inchi_key}")

# Batch validate a list of SMILES
test_smiles = [
    "CCO",                    # Valid: ethanol
    "CC(=O)Oc1ccccc1C(=O)O", # Valid: aspirin
    "C1CCCCC",                # Invalid: unclosed ring
    "CC(C)(C)(C)C",           # Invalid: pentavalent carbon
    "c1ccccc1",               # Valid: benzene
]

for smi in test_smiles:
    try:
        result = client.chemistry.convert(smiles=smi, output_format="canonical_smiles")
        print(f"VALID:   {smi} -> {result.canonical_smiles}")
    except Exception as e:
        print(f"INVALID: {smi} -> {e}")

For local validation without API calls, RDKit is the standard library:

python
from rdkit import Chem

def validate_smiles(smiles: str) -> bool:
    """Return True if SMILES is valid, False otherwise."""
    mol = Chem.MolFromSmiles(smiles)
    return mol is not None

def canonicalize(smiles: str) -> str:
    """Return canonical SMILES or raise ValueError."""
    mol = Chem.MolFromSmiles(smiles)
    if mol is None:
        raise ValueError(f"Invalid SMILES: {smiles}")
    return Chem.MolToSmiles(mol)

# Examples
print(validate_smiles("CCO"))        # True
print(validate_smiles("C1CCCCC"))    # False
print(canonicalize("C(C)(=O)Oc1ccccc1C(=O)O"))  # CC(=O)Oc1ccccc1C(=O)O

Programmatic SMILES with the SciRouter API

Once you can read and write SMILES, the real power comes from using them programmatically. Every SciRouter chemistry endpoint accepts SMILES as input. Here are the most common operations.

Calculate Molecular Properties from SMILES

python
import scirouter

client = scirouter.SciRouter(api_key="sk-sci-YOUR_KEY")

# Calculate properties for caffeine
result = client.chemistry.properties(smiles="Cn1c(=O)c2c(ncn2C)n(C)c1=O")

print(f"Molecular weight: {result.molecular_weight:.2f} Da")
print(f"LogP: {result.logp:.2f}")
print(f"H-bond donors: {result.h_bond_donors}")
print(f"H-bond acceptors: {result.h_bond_acceptors}")
print(f"Rotatable bonds: {result.rotatable_bonds}")
print(f"TPSA: {result.tpsa:.1f} A^2")
print(f"Lipinski violations: {result.lipinski_violations}")

Compare Molecules by Similarity

python
# Compare ibuprofen to naproxen (both are NSAIDs)
similarity = client.chemistry.similarity(
    smiles_a="CC(C)Cc1ccc(cc1)C(C)C(=O)O",  # Ibuprofen
    smiles_b="COc1ccc2cc(ccc2c1)C(C)C(=O)O",  # Naproxen
)
print(f"Tanimoto similarity: {similarity.tanimoto:.3f}")
# Expected: ~0.55 (same pharmacological class, different scaffolds)

Convert Between Formats

python
# SMILES to InChI to SMILES round-trip
aspirin_smiles = "CC(=O)Oc1ccccc1C(=O)O"

to_inchi = client.chemistry.convert(smiles=aspirin_smiles, output_format="inchi")
print(f"InChI: {to_inchi.inchi}")

# InChI back to SMILES
to_smiles = client.chemistry.convert(inchi=to_inchi.inchi, output_format="canonical_smiles")
print(f"Back to SMILES: {to_smiles.canonical_smiles}")
Note
The SciRouter Molecular Properties and Format Conversion endpoints accept SMILES, InChI, and MOL block inputs. You can start from any format and convert to any other.

SMILES in Machine Learning and Generative Chemistry

SMILES strings are the primary input representation for molecular machine learning models. Transformer-based models like ChemBERTa tokenize SMILES into subword units and learn chemical representations from millions of molecules. Generative models like REINVENT4 generate new molecules by sampling SMILES strings character by character, constrained by the grammar rules of the notation.

This has a practical implication: if you are training or fine-tuning molecular ML models, your SMILES must be clean and canonical. Non-canonical SMILES introduce unnecessary variation that makes training harder. Invalid SMILES waste training samples. Always canonicalize your SMILES dataset before training, and validate generated SMILES before evaluating them.

SMILES augmentation – generating multiple valid but non-canonical SMILES for the same molecule – is sometimes used as a data augmentation technique in training. Each randomized SMILES traverses the molecular graph differently, giving the model multiple views of the same molecule. SciRouter's format conversion endpoint returns canonical SMILES, which is what you want for production inference. For training data augmentation, use RDKit'sChem.MolToSmiles(mol, doRandom=True).

Quick Reference: SMILES Symbols

Here is a compact reference for all SMILES notation elements covered in this guide:

  • Atoms: C, N, O, S, P, F, Cl, Br, I (organic subset); [Fe], [Cu+2], [NH4+] (bracketed)
  • Aromatic atoms: c, n, o, s (lowercase)
  • Bonds: (implicit) single, = double, # triple, : aromatic
  • Branches: ( ) parentheses
  • Rings: digits 1-9 or %10-%99 for ring closures
  • Stereochemistry: / and \ for cis/trans; @ and @@ for chirality
  • Charges: [NH4+], [O-], [Fe+3] (inside brackets)
  • Disconnected fragments: . (period) between components
  • Isotopes: [2H], [13C], [14N] (mass number before atom symbol)
  • Explicit hydrogens: [CH4], [NH3], [OH2] (inside brackets)

Try It Now: Convert and Validate SMILES

Practice reading and writing SMILES using the SMILES Converter free tool – no account or API key required. Paste a SMILES string and instantly see the canonical form, InChI, InChIKey, molecular formula, and a 2D structure rendering.

For programmatic access, create a free SciRouter account at scirouter.ai/signup and use the format conversion and molecular properties endpoints to validate, convert, and analyze molecules in your own workflows. The similarity endpoint lets you compare molecules to find analogs, cluster libraries, and assess diversity.

SMILES notation is the foundation of everything in computational chemistry. Once you can read CC(=O)Oc1ccccc1C(=O)O and see aspirin in your mind, every cheminformatics tool, every database, and every generative chemistry model becomes accessible. The notation is simple, the vocabulary is small, and the payoff is enormous.

Frequently Asked Questions

What does SMILES stand for?

SMILES stands for Simplified Molecular-Input Line-Entry System. It was developed by David Weininger in the late 1980s as a way to represent chemical structures as compact text strings. The original paper was published in 1988 in the Journal of Chemical Information and Computer Sciences. Despite its age, SMILES remains the most widely used molecular line notation in cheminformatics, drug discovery, and computational chemistry.

Is SMILES notation unique for each molecule?

No. A single molecule can have many valid SMILES representations depending on which atom you start from and which path you take through the molecular graph. For example, ethanol can be written as CCO, OCC, or C(O)C. However, there is a standardized form called canonical SMILES that produces a single, unique string for each molecule using a deterministic algorithm. RDKit, OpenBabel, and SciRouter all generate canonical SMILES. If you need to compare molecules or use SMILES as database keys, always use canonical SMILES.

Can SMILES represent proteins or DNA?

Technically yes, but practically no. SMILES can represent any covalent structure, including amino acids and nucleotides. However, a protein with 300 residues would produce a SMILES string thousands of characters long that is impossible to read or work with. For biopolymers, specialized notations exist: FASTA for protein sequences, HELM for complex biopolymers, and FASTA/GenBank for nucleic acids. SMILES is best suited for small molecules (MW below roughly 1,000 Da).

What is the difference between SMILES and SMARTS?

SMILES describes a specific molecule. SMARTS (SMiles ARbitrary Target Specification) describes a pattern that can match multiple molecules. SMARTS uses the same basic syntax as SMILES but adds wildcards and logical operators. For example, the SMARTS pattern [#6]=[#6] matches any carbon-carbon double bond, while [OH] matches any hydroxyl group. SMARTS is used for substructure searching, reaction definitions, and chemical pattern matching. If SMILES is a photograph of a molecule, SMARTS is a police sketch that matches multiple suspects.

How do I handle salts and multi-component systems in SMILES?

Multi-component SMILES uses a period (.) to separate disconnected fragments. For example, sodium chloride is [Na+].[Cl-], and the hydrochloride salt of an amine R-NH2 would be written as R-NH3+.[Cl-]. In drug discovery, you often encounter salts (metformin hydrochloride), co-crystals, and solvates. When submitting SMILES to property prediction APIs, use the parent molecule (free base or free acid) without the salt, as the salt form affects physical properties but not the core molecular structure that models predict on.

Why does my SMILES string fail validation in some tools?

The most common causes of SMILES validation failure are: mismatched ring closure digits (opening a ring with 1 but not closing it), unbalanced parentheses in branch notation, invalid valence (carbon with 5 bonds), invalid atom symbols (using lowercase where uppercase is needed for non-aromatic atoms), and improper stereochemistry notation. Use a SMILES validator or the SciRouter format conversion endpoint to check your SMILES. RDKit's Chem.MolFromSmiles() returns None for invalid SMILES, which is the standard validation check in Python.

Try this yourself

500 free credits. No credit card required.