What does SMILES stand for?

SMILES stands for Simplified Molecular Input Line Entry System. It was developed by David Weininger in the late 1980s as a way to represent chemical structures as simple text strings that are both human-readable and machine-parseable.

Are SMILES strings unique for a given molecule?

No. A single molecule can have many valid SMILES representations. For example, ethanol can be written as CCO, OCC, or C(O)C. Canonical SMILES algorithms (like those in RDKit or OpenBabel) produce a single standardized string for each molecule, which is useful for database lookups and deduplication.

What is the difference between SMILES and InChI?

SMILES is designed for human readability and compact representation. InChI (International Chemical Identifier) is designed for unique identification — each molecule has exactly one standard InChI. InChI strings are longer and harder to read, but they are better for database deduplication and chemical registration systems.

Can SMILES represent proteins or polymers?

Standard SMILES is designed for small molecules. While you can technically write a SMILES string for a short peptide, it becomes impractical for anything beyond a few amino acids. For proteins, use FASTA sequences. For polymers, extensions like BigSMILES or HELM are more appropriate.

How do I validate a SMILES string?

Use SciRouter’s /v1/chemistry/properties endpoint with any SMILES string. If the SMILES is valid, you will get molecular properties back. If it is invalid, the API returns an error. Programmatically, RDKit’s Chem.MolFromSmiles() returns None for invalid SMILES.

Do SMILES encode 3D geometry?

Standard SMILES encode molecular topology (atoms, bonds, connectivity) and can represent stereochemistry (E/Z double bonds, R/S chirality), but they do not encode 3D coordinates. To get 3D geometry from SMILES, you need a conformer generation step, which tools like RDKit or SciRouter’s API can perform.

SMILES Notation Explained: A Complete Guide

What Is SMILES Notation?

SMILES – Simplified Molecular Input Line Entry System – is the most widely used text format for representing chemical structures. Developed by David Weininger in 1988, it encodes molecular graphs as compact ASCII strings. Every atom, bond, ring, branch, and stereocenter in a molecule can be described in a single line of text that is both human-readable and machine-parseable.

If you work in cheminformatics, drug discovery, or computational chemistry, SMILES is unavoidable. It is the default input format for molecular property calculators, docking tools, machine learning models, and chemical databases like PubChem, ChEMBL, and ZINC. When you call SciRouter's chemistry APIs, SMILES is how you specify your molecules.

Basic Syntax Rules

Atoms

Atoms are represented by their chemical symbols. Organic subset atoms (B, C, N, O, P, S, F, Cl, Br, I) can be written without brackets. All other atoms, or atoms with non-default properties (charge, isotope, explicit hydrogen count), must be enclosed in square brackets.

C – carbon (sp3, implicit hydrogens to fill valence)
N – nitrogen
[Fe] – iron (not in organic subset, needs brackets)
[NH4+] – ammonium ion (charged, explicit H count)
[13C] – carbon-13 isotope

Bonds

Single bonds between organic subset atoms are implicit – just write the atoms next to each other. Other bond types use explicit symbols:

CC – ethane (C-C single bond, implicit)
C=C – ethylene (double bond)
C#C – acetylene (triple bond)
c1ccccc1 – benzene (aromatic bonds, lowercase atoms)

Aromatic atoms are written in lowercase. Benzene is c1ccccc1 rather than C1=CC=CC=C1, though both are valid. The lowercase notation is called the aromatic SMILES form and is generally preferred for readability.

Branches

Branches off the main chain are enclosed in parentheses. The branch starts from the atom immediately before the opening parenthesis:

CC(C)C – isobutane (2-methylpropane)
CC(=O)O – acetic acid (methyl group, then a branch with C=O and C-O)
CC(C)(C)C – neopentane (three methyl branches on central carbon)

Rings

Rings are denoted by matching digits after the atoms where the ring opens and closes. The digit indicates that a bond connects these two atoms:

C1CCC1 – cyclobutane (ring opens at first C, closes at last C)
c1ccccc1 – benzene (aromatic six-membered ring)
C1CC2CCCCC2CC1 – decalin (two fused rings using digits 1 and 2)

For molecules with more than 9 rings, use the percent notation: %10, %11, etc. This is rare for typical drug molecules but appears in complex natural products.

Stereochemistry

SMILES can encode both tetrahedral chirality and double bond geometry:

C(/F)=C/Cl – trans-1-chloro-2-fluoroethylene (E configuration)
C(/F)=C\Cl – cis isomer (Z configuration)
[C@@H](F)(Cl)Br – R-fluorochlorobromomethane (tetrahedral chirality)
[C@H](F)(Cl)Br – S-enantiomer

The @ and @@ symbols specify the arrangement of neighbors around a chiral center when viewed from the first neighbor listed. If stereochemistry is omitted, the SMILES is considered unspecified at that center.

10 Common Molecules and Their SMILES

Here are ten well-known molecules with their SMILES strings. These make good test inputs for API calls and help build intuition for reading SMILES:

Common molecules and their SMILES

Molecule           SMILES                                         MW (g/mol)
─────────────────────────────────────────────────────────────────────────────
Ethanol            CCO                                            46.07
Aspirin            CC(=O)Oc1ccccc1C(=O)O                         180.16
Caffeine           Cn1c(=O)c2c(ncn2C)n(C)c1=O                   194.19
Glucose            OC[C@H]1OC(O)[C@H](O)[C@@H](O)[C@@H]1O      180.16
Ibuprofen          CC(C)Cc1ccc(C(C)C(=O)O)cc1                   206.28
Penicillin G       CC1(C)[C@@H](C(=O)O)N2C(=O)[C@@H](NC(=O)Cc3ccccc3)[C@H]2S1    334.39
Acetaminophen      CC(=O)Nc1ccc(O)cc1                            151.16
Dopamine           NCCc1ccc(O)c(O)c1                             153.18
Serotonin          NCCc1c[nH]c2ccc(O)cc12                        176.22
Cholesterol        CC(C)CCCC(C)C1CCC2C3CC=C4CC(O)CCC4(C)C3CCC12C  386.65

Tip

Copy any of these SMILES strings and paste them into SciRouter's molecular properties tool to instantly calculate their physicochemical properties. No software installation required.

Reading SMILES: A Worked Example

Let's decode aspirin's SMILES step by step: CC(=O)Oc1ccccc1C(=O)O

C – methyl carbon
C(=O) – carbonyl carbon (double bond to oxygen, branch)
O – ester oxygen connecting to the ring
c1ccccc1 – benzene ring (aromatic carbons, ring opens and closes at digit 1)
C(=O) – second carbonyl carbon (carboxylic acid)
O – hydroxyl oxygen of the acid group

Reading left to right, you trace the molecular graph: a methyl group attached to an ester that bridges to a benzene ring, which carries a carboxylic acid group. This is exactly the structure of acetylsalicylic acid (aspirin).

Converting Between Molecular Formats via API

SMILES is not the only molecular format. InChI, MOL/SDF, and PDB are also widely used. SciRouter's format conversion endpoint converts between these formats with a single API call:

Convert SMILES to InChI and MOL

import requests

API_KEY = "sk-sci-your-api-key"
BASE = "https://api.scirouter.ai/v1"
headers = {"Authorization": f"Bearer {API_KEY}"}

# Convert caffeine from SMILES to InChI and MOL format
response = requests.post(f"{BASE}/chemistry/convert", headers=headers, json={
    "smiles": "Cn1c(=O)c2c(ncn2C)n(C)c1=O",
    "output_formats": ["inchi", "inchi_key", "mol", "canonical_smiles"]
})

result = response.json()
print(f"Canonical SMILES: {result['canonical_smiles']}")
print(f"InChI:            {result['inchi']}")
print(f"InChI Key:        {result['inchi_key']}")
# MOL block is a multi-line 3D format used by many tools
print(f"MOL block lines:  {len(result['mol'].splitlines())}")

This is useful when you have a SMILES string and need to feed it into a tool that expects InChI (like a chemical database), or when you receive an SDF file from a docking run and need to extract SMILES for further processing.

Calculating Molecular Properties from SMILES

One of the most common tasks in cheminformatics is computing physicochemical properties from a molecular structure. Traditionally this required installing RDKit locally – a C++ library with Python bindings that is notoriously difficult to set up on some systems. SciRouter's API gives you the same calculations with zero installation:

Calculate properties for a single molecule

# Calculate properties for ibuprofen
response = requests.post(f"{BASE}/chemistry/properties", headers=headers, json={
    "smiles": "CC(C)Cc1ccc(C(C)C(=O)O)cc1"
})

props = response.json()
print(f"Molecular weight:  {props['molecular_weight']:.2f} g/mol")
print(f"LogP:              {props['logp']:.2f}")
print(f"TPSA:              {props['tpsa']:.1f} Å²")
print(f"H-bond donors:     {props['hbd']}")
print(f"H-bond acceptors:  {props['hba']}")
print(f"Rotatable bonds:   {props['rotatable_bonds']}")
print(f"Lipinski violations: {props['lipinski_violations']}")

Batch Property Calculation

For screening a compound library, send multiple SMILES in a single request:

Batch molecular properties

# Calculate properties for multiple molecules at once
compounds = {
    "aspirin":       "CC(=O)Oc1ccccc1C(=O)O",
    "caffeine":      "Cn1c(=O)c2c(ncn2C)n(C)c1=O",
    "ibuprofen":     "CC(C)Cc1ccc(C(C)C(=O)O)cc1",
    "acetaminophen": "CC(=O)Nc1ccc(O)cc1",
    "dopamine":      "NCCc1ccc(O)c(O)c1",
}

response = requests.post(f"{BASE}/chemistry/properties", headers=headers, json={
    "smiles_list": list(compounds.values()),
    "labels": list(compounds.keys())
})

results = response.json()["results"]
print(f"{'Name':<16} {'MW':>8} {'LogP':>6} {'TPSA':>6} {'Lipinski':>9}")
print("-" * 50)
for r in results:
    print(f"{r['label']:<16} {r['molecular_weight']:>8.1f} "
          f"{r['logp']:>6.2f} {r['tpsa']:>6.1f} "
          f"{r['lipinski_violations']:>9}")

Note

Lipinski's Rule of Five is a quick drug-likeness filter: molecular weight under 500, LogP under 5, no more than 5 hydrogen bond donors, and no more than 10 acceptors. All five molecules above pass – which makes sense, since they are all approved drugs.

SMILES in Drug Discovery Workflows

SMILES strings are the connective tissue that links different stages of a computational drug discovery pipeline. Here is how they flow through a typical workflow:

Compound enumeration: Combinatorial chemistry tools generate SMILES for virtual libraries of candidate molecules.
Property filtering: SMILES feed into property calculators to filter by drug-likeness, solubility, and other physicochemical criteria.
Similarity searching: Chemical fingerprints derived from SMILES enable similarity searches against known active compounds.
Molecular docking: Docking tools like DiffDock accept SMILES as ligand input and predict binding poses.
Machine learning: SMILES are tokenized as input to molecular property prediction models, generative chemistry models, and ADMET predictors.

For a hands-on example of calculating molecular properties without installing RDKit, see our molecular properties from SMILES tutorial.

Common SMILES Pitfalls

Even experienced chemists make mistakes with SMILES. Here are the most frequent issues:

Mismatched ring digits: Every ring-opening digit must have a matching ring-closing digit. C1CCC is invalid because digit 1 never closes.
Unbalanced parentheses: Every opening parenthesis needs a closing one. CC(=O is invalid.
Valence errors: Carbon has valence 4. C(C)(C)(C)(C)C puts five bonds on the first carbon and is invalid. Some tools silently accept valence violations, which leads to garbage results downstream.
Aromatic vs. non-aromatic: Writing c1ccc1 (four-membered aromatic ring) is chemically invalid because cyclobutadiene is not aromatic. Use uppercase for non-aromatic rings: C1CCC1.
Missing stereochemistry: If your application requires distinguishing enantiomers, you must include @/@@ notation. Omitting it means the SMILES represents a mixture of all stereoisomers.

Canonicalization: Getting a Unique SMILES

Because the same molecule can be written as many different SMILES strings, databases and search tools use canonical SMILES – a single standardized representation. You can get canonical SMILES through SciRouter's format conversion endpoint:

Canonicalize SMILES strings

# These are all valid SMILES for ethanol
ethanol_variants = ["CCO", "OCC", "C(O)C", "[CH3][CH2][OH]"]

for smiles in ethanol_variants:
    resp = requests.post(f"{BASE}/chemistry/convert", headers=headers, json={
        "smiles": smiles,
        "output_formats": ["canonical_smiles"]
    }).json()
    print(f"{smiles:20s} -> {resp['canonical_smiles']}")

# All four will produce the same canonical SMILES: CCO

Canonicalization is essential when comparing molecules, deduplicating compound libraries, or looking up molecules in a database. Always canonicalize before comparing SMILES strings, since string equality of non-canonical SMILES does not guarantee molecular identity.

Beyond SMILES: Other Molecular Representations

SMILES is the most popular but not the only molecular text format. Here is how it compares:

InChI: Unique canonical identifier. Better for databases and registration. Harder to read. Use SciRouter's format conversion to translate between SMILES and InChI.
SMARTS: A superset of SMILES designed for substructure searching. Supports wildcards, atom lists, and recursive patterns. Used for reaction mapping and toxicophore screening.
MOL/SDF: Connection table format with optional 3D coordinates. Used by docking tools and crystal structure databases. More verbose but encodes geometry.
SELFIES: Self-referencing embedded strings. A newer format designed for machine learning that guarantees every string maps to a valid molecule. Increasingly used in generative chemistry.

Next Steps

Now that you understand SMILES notation, you can use it as input to any of SciRouter's chemistry tools:

Calculate physicochemical properties with the molecular properties endpoint
Convert between formats using format conversion
Dock molecules against protein targets with DiffDock
Learn how to calculate molecular properties without installing RDKit

All endpoints accept standard SMILES as input. For best results, use canonical SMILES and validate your structures before submitting to computationally intensive tools like docking. Sign up for a free API key to start exploring.

SMILES Notation Explained: A Complete Guide

What Is SMILES Notation?

Basic Syntax Rules

Atoms

Bonds

Branches

Rings

Stereochemistry

10 Common Molecules and Their SMILES

Reading SMILES: A Worked Example

Converting Between Molecular Formats via API

Calculating Molecular Properties from SMILES

Batch Property Calculation

SMILES in Drug Discovery Workflows

Common SMILES Pitfalls

Canonicalization: Getting a Unique SMILES

Beyond SMILES: Other Molecular Representations

Next Steps

Frequently Asked Questions

What does SMILES stand for?

Are SMILES strings unique for a given molecule?

What is the difference between SMILES and InChI?

Can SMILES represent proteins or polymers?

How do I validate a SMILES string?

Do SMILES encode 3D geometry?

Related Tools

Molecular Properties — RDKit

Chemical Format Conversion

Molecule Similarity

Try It Free

Molecular Weight Calculator

SMILES Converter

Tanimoto Similarity

More in the Chemistry Series

ADMET Prediction: What Every Drug Developer Needs to Know

Calculate Molecular Properties from SMILES (No RDKit Needed)

Lipinski's Rule of Five: Drug-Likeness Explained

Try this yourself