Drug DiscoveryDrug Discovery Tools

Latent Space Biology: How AI Imagines New Molecules and Proteins

What is latent space in biology? Learn how VAEs, diffusion models, and RL navigate molecular and protein space to design new drugs and proteins. The definitive guide for biologists.

Ryan Bethencourt
April 8, 2026
10 min read

What Is a Latent Space? A Map of Everything Possible

Imagine you had a map of every molecule that could ever exist. Not just the molecules that have been synthesized in labs or found in nature, but every possible arrangement of atoms that forms a stable, drug-like compound. This map would be impossibly large – chemists estimate there are roughly 10 to the power of 60 possible drug-like molecules, far more than the number of atoms in the observable universe. No one can draw this map by hand.

But AI can learn an approximation of it. When a neural network is trained on millions of known molecules, it builds an internal representation – a compressed coordinate system – where every molecule gets a position. Similar molecules end up near each other. Molecules that share a scaffold cluster together. Molecules with similar biological activity occupy neighboring regions, even if their chemical structures look different on paper. This internal coordinate system is called a latent space.

The word "latent" means hidden. The latent space is the hidden representation that the model learned, not something a human designed. It emerges from the patterns in the training data. And it turns out to be extraordinarily useful, because once you have a map of molecular space, you can navigate it. You can walk from one molecule to a neighboring one. You can search for regions with specific properties. You can even do arithmetic – combining the coordinates of two molecules to create a third with blended characteristics.

This article explains latent spaces for biologists and chemists who use these tools but want to understand what is happening under the hood. No equations, no code prerequisites – just the conceptual framework you need to think clearly about generative AI in drug discovery and protein design.

The Chemical Universe Map: A Visual Metaphor

Think of the latent space as a vast landscape. Each point on this landscape is a molecule. The terrain has structure: there are mountain ranges of potent kinase inhibitors, valleys of antibiotics, plains of anti-inflammatory compounds. Molecules that are chemically similar are geographically close. Molecules that do completely different things are on different continents.

Traditional drug discovery is like exploring this landscape on foot. You start at a known active compound (say, aspirin) and take small steps in every direction – adding a methyl group here, swapping a nitrogen for a carbon there. You explore the immediate neighborhood thoroughly, but you never leave the aspirin continent. Medicinal chemists call this "SAR exploration" (structure-activity relationships), and it has produced most of the drugs on the market today.

Generative AI models give you a helicopter. They learn the entire landscape from training data and can fly you to any point on the map. Want a molecule that combines the potency of one compound with the metabolic stability of another? The model can navigate to a region of the map that satisfies both criteria, even if that region is on a completely different continent from where you started. This is why generative chemistry is such a profound shift – it changes drug design from local optimization to global exploration.

The landscape metaphor also explains why latent spaces are smooth. A well-trained model creates a landscape where you can walk smoothly between any two molecules, passing through valid intermediate structures the entire way. There are no cliffs or gaps. This smoothness is what makes interpolation and arithmetic possible – and it is the key property that separates a latent space from a random collection of molecular fingerprints.

How AI Models Learn Molecular Latent Spaces

Three main architectures are used to learn latent spaces for molecules. Each creates a different kind of map with different properties. Understanding the differences helps you choose the right tool for your application.

Variational Autoencoders (VAEs)

A variational autoencoder has two halves: an encoder and a decoder. The encoder takes a molecule (as a SMILES string or molecular graph) and compresses it into a point in latent space – typically a vector of 128 to 512 numbers. The decoder takes a point in latent space and reconstructs a molecule from it. The model is trained so that encoding and then decoding a molecule recovers the original structure.

The key innovation of VAEs is that the latent space is forced to be smooth and continuous. The model is penalized for creating gaps or isolated clusters. This means you can sample a random point from the latent space and decode it into a valid molecule, even if that exact point was never seen during training. You can also interpolate between two molecules by moving along a straight line between their latent coordinates – every point along the line decodes to a valid, chemically reasonable intermediate structure.

VAEs were the first architecture widely used for molecular generation. The landmark paper "Automatic Chemical Design Using a Data-Driven Continuous Representation of Molecules" (Gomez-Bombarelli et al., 2018) showed that you could optimize molecules for desired properties by searching through the latent space with Bayesian optimization. Since then, VAEs have been used for fragment-based drug design, reaction prediction, and molecular property optimization.

Diffusion Models

Diffusion models take a different approach. Instead of compressing molecules into a latent space and reconstructing them, diffusion models learn to start with pure noise and iteratively refine it into a valid structure. Each refinement step removes a small amount of noise, and after enough steps, a clean molecule emerges.

The latent space of a diffusion model is implicit – it is encoded in the learned denoising function rather than in an explicit coordinate system. This makes diffusion models less interpretable than VAEs, but they tend to generate higher-quality and more diverse outputs. DiffDock, the molecular docking tool, is a diffusion model that operates in the space of protein-ligand binding poses. Other diffusion models generate 3D molecular conformations or entirely new molecular structures.

For drug discovery, diffusion models are particularly powerful when you need to generate 3D structures rather than 2D graphs or SMILES strings. They naturally produce molecules in three dimensions, respecting bond angles, ring strain, and steric constraints that flat representations miss.

Reinforcement Learning (RL)

Reinforcement learning does not create a latent space in the traditional sense. Instead, it trains a generative model (usually an RNN or transformer) to produce molecules that score well on a reward function. The model starts by generating random valid SMILES strings, evaluates them with a scoring function (which might include predicted binding affinity, drug-likeness, and synthetic accessibility), and updates its generation policy to produce more high-scoring molecules.

REINVENT4, developed by AstraZeneca, is the most widely used RL-based molecular generator. It does not give you a navigable latent space map, but it gives you something arguably more practical: a generator that is specifically tuned to produce molecules matching your design criteria. Think of it less as a map and more as a guide who knows the landscape and can take you directly to the region you need.

In practice, RL-based generators are often combined with VAE-learned latent spaces. The RL policy explores the VAE latent space, combining the global coverage of the map with the goal-directed efficiency of reinforcement learning.

Latent Space Arithmetic: Aspirin + More Potent - More Toxic = ?

One of the most striking properties of well-trained latent spaces is that you can do meaningful arithmetic with molecular representations. This is analogous to the famous word embedding result where "king - man + woman = queen" – except with molecules instead of words.

Here is how it works in principle. Suppose you have a molecule A that is potent but toxic, and a molecule B that is non-toxic but weak. You encode both into the latent space. Then you compute a "toxicity direction" by finding the vector that points from non-toxic molecules toward toxic ones (learned from a dataset of molecules with known toxicity labels). You subtract this toxicity direction from molecule A's latent vector, effectively moving it away from the toxic region of the map. The resulting latent vector, when decoded, gives you a new molecule that retains some of A's potency while reducing toxicity.

In practice, latent space arithmetic works but is approximate. The decoded molecule will not be exactly what you hope for – it will be a suggestion, a starting point that needs evaluation and refinement. But it is a remarkably powerful way to encode design intent into a mathematical operation. Instead of telling a chemist "make this molecule less toxic but keep the potency," you can express that as a vector operation and get a concrete molecular structure back.

SciRouter's molecule generation tools use these principles internally. When you specify optimization objectives like "maximize predicted binding affinity while keeping synthetic accessibility below 4.0," the model is navigating a latent space toward the region that satisfies your constraints. You do not need to manipulate latent vectors directly – you express your intent as objectives, and the model navigates the space for you.

Multi-objective molecule generation with latent space navigation
import os, requests, time

API_KEY = os.environ["SCIROUTER_API_KEY"]
BASE = "https://api.scirouter.ai/v1"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

# Generate molecules that balance potency, safety, and synthesizability
# Internally, the model navigates its latent space toward the region
# satisfying all three objectives simultaneously
job = requests.post(f"{BASE}/chemistry/generate", headers=HEADERS, json={
    "model": "reinvent4",
    "num_molecules": 50,
    "objectives": {
        "similarity": {
            "weight": 0.6,
            "reference_smiles": "CC(=O)Oc1ccccc1C(=O)O",  # aspirin
            "min_similarity": 0.2,
            "max_similarity": 0.6,
        },
        "drug_likeness": {"weight": 1.0, "method": "lipinski"},
        "synthetic_accessibility": {"weight": 0.8, "max_sa_score": 3.5},
        "molecular_weight": {"weight": 0.3, "min": 200, "max": 450},
    },
}).json()

print(f"Job submitted: {job['job_id']}")

# Poll for results
while True:
    result = requests.get(
        f"{BASE}/chemistry/generate/{job['job_id']}", headers=HEADERS
    ).json()
    if result["status"] == "completed":
        break
    if result["status"] == "failed":
        raise RuntimeError(result.get("error", "Generation failed"))
    time.sleep(5)

print(f"Generated {len(result['molecules'])} molecules\n")
for i, mol in enumerate(result["molecules"][:5]):
    print(f"Molecule {i+1}: {mol['smiles']}")
    print(f"  Similarity to aspirin: {mol['scores']['similarity']:.2f}")
    print(f"  Drug-likeness: {mol['scores']['drug_likeness']:.2f}")
    print(f"  SA score: {mol['scores']['synthetic_accessibility']:.1f}")
    print()
Note
The similarity bounds (0.2 to 0.6 relative to aspirin) ensure generated molecules are inspired by aspirin's scaffold but are genuinely novel – different enough to represent new chemical matter, close enough to share useful properties. This is latent space navigation expressed as an API parameter.

Protein Latent Spaces: The ESM Revolution

Everything described above for small molecules also applies to proteins, but the latent spaces are learned differently and encode different kinds of information. Protein language models like ESM-2 (Evolutionary Scale Modeling) learn latent representations by training on hundreds of millions of protein sequences, using the same masked-language-modeling approach as BERT. The model sees protein sequences with some amino acids masked out and learns to predict the missing residues from context.

In the process, ESM-2 learns an extraordinarily rich latent space. Each protein sequence is represented as a high-dimensional vector (the embedding) that captures evolutionary relationships, structural features, and functional properties. Proteins with similar structures have similar embeddings, even if their sequences have diverged beyond detectable homology. Proteins with similar functions cluster together, even if they achieve that function through completely different folds.

ESMFold, the protein structure prediction model available on SciRouter, uses ESM-2's latent representations as its foundation. When you submit a protein sequence to ESMFold, the model first computes the sequence embedding in ESM-2's latent space, then uses that embedding to predict the 3D structure. The accuracy of the structure prediction comes directly from the quality of the latent representation – the model has learned enough about protein biology from sequence data alone to predict how proteins fold.

Protein embeddings from ESM-2 are also useful on their own, without structure prediction. You can use them for protein classification, function prediction, variant effect prediction, and protein engineering. Two proteins with similar embeddings are likely to have similar structures and functions, making embeddings a powerful feature for machine learning models in biology.

Explore protein latent space with ESM-2 embeddings
import os, requests
import numpy as np

API_KEY = os.environ["SCIROUTER_API_KEY"]
BASE = "https://api.scirouter.ai/v1"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

# Get embeddings for three related proteins
proteins = {
    "human_hemoglobin_alpha":  "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH",
    "human_hemoglobin_beta":   "MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST",
    "human_myoglobin":         "MGLSDGEWQLVLNVWGKVEADIPGHGQEVLIRLFKGHPETLEKFDKFKHLK",
}

embeddings = {}
for name, seq in proteins.items():
    result = requests.post(f"{BASE}/proteins/embeddings",
        headers=HEADERS, json={"sequence": seq}).json()
    embeddings[name] = np.array(result["embedding"])
    print(f"{name}: embedding shape = {embeddings[name].shape}")

# Compute pairwise cosine similarities
def cosine_sim(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

print("\nPairwise similarities in ESM-2 latent space:")
names = list(embeddings.keys())
for i in range(len(names)):
    for j in range(i+1, len(names)):
        sim = cosine_sim(embeddings[names[i]], embeddings[names[j]])
        print(f"  {names[i]} vs {names[j]}: {sim:.3f}")

# Hemoglobin alpha and beta should be more similar to each other
# than either is to myoglobin, reflecting their evolutionary relationship

In this example, the ESM-2 embeddings capture the evolutionary and structural relationships between globin proteins. Hemoglobin alpha and beta, which form a heterodimer and share a common ancestor, will have more similar embeddings than either has to myoglobin, even though all three share the globin fold. The latent space has learned protein biology from sequence data alone.

Drug Design in Latent Space: REINVENT4 and Beyond

REINVENT4, developed by AstraZeneca, is the most mature RL-based molecular generation platform and the primary generator available through SciRouter. While it does not use an explicit VAE latent space, it operates on a closely related principle: a recurrent neural network (RNN) trained on SMILES strings has learned an implicit representation of chemical space, and reinforcement learning steers generation toward desired regions of that space.

The power of REINVENT4 lies in its multi-objective optimization. You define a scoring function with weighted components – predicted binding affinity to a target, Lipinski drug-likeness, synthetic accessibility, ADMET properties, novelty relative to known compounds – and the model generates molecules that maximize the composite score. Each generation round produces a batch of candidates, the best are selected, and the model updates to produce even better candidates in the next round.

This is latent space navigation in action, even if the space is not explicitly defined as a coordinate system. The model has learned the landscape of drug-like molecules and is walking through it toward the region that satisfies your multi-objective criteria. The pharmaceutical industry has validated this approach: multiple AI-designed molecules from similar generative platforms have entered Phase I and Phase II clinical trials, including compounds from Insilico Medicine, Exscientia, and Recursion Pharmaceuticals.

Through the SciRouter API, you can access REINVENT4 without installing any software or managing GPU infrastructure. Define your objectives, submit a generation job, and receive a ranked list of novel molecules with computed properties. The entire workflow is accessible from a Python script or the Molecular Design Lab dashboard.

Antibody Design: Navigating Immune Latent Spaces

Antibodies are a special case where latent space thinking is particularly powerful. The antibody repertoire is generated by the immune system through V(D)J recombination and somatic hypermutation – a natural generative process that explores the space of possible antigen-binding sequences. AI models for antibody design learn latent representations of this space and navigate it to find sequences with desired binding properties.

AntiFold, available on SciRouter, designs antibody CDR (complementarity-determining region) sequences by learning the relationship between antibody structure and sequence. Given an antibody framework structure, AntiFold generates CDR sequences that are predicted to fold correctly and bind the target antigen. The model has learned a latent space of CDR sequences conditioned on structural context, allowing it to generate diverse but structurally valid binding loops.

ImmuneBuilder, also on SciRouter, works in the other direction – it predicts antibody 3D structure from sequence. Together, these tools let you iterate between sequence design and structure prediction, exploring the antibody latent space from both directions. Design a CDR sequence with AntiFold, predict its structure with ImmuneBuilder, evaluate the structure, and refine the design.

This iterative loop is how modern computational antibody engineering works. The latent space provides the terrain, and each design-predict-evaluate cycle is a step through that terrain toward a therapeutic antibody candidate.

The Future: Unified Biological Latent Spaces

Today, molecular latent spaces and protein latent spaces are learned separately. Small molecule models train on SMILES strings from ChEMBL. Protein models train on sequences from UniProt. These are separate maps for separate continents of biology.

The future is unified latent spaces that encode molecules, proteins, nucleic acids, and their interactions in a single representation. Early examples of this are already emerging. Models like Chai-1 and Boltz-2 predict protein-ligand complexes by learning joint representations of proteins and small molecules. ESM3, the latest evolution of protein language models, jointly models sequence, structure, and function in a unified latent space.

A truly unified biological latent space would let you ask questions like: "Given this protein target and this desired mechanism of action, what molecule should I make?" The answer would be a point in the joint latent space that simultaneously specifies the protein conformation, the ligand structure, and the binding mode. We are not there yet, but the trajectory is clear, and each generation of models moves closer.

For drug discovery, unified latent spaces would collapse the sequential pipeline (target identification, then lead discovery, then lead optimization, then ADMET profiling) into a single generative step. Instead of running five separate models and passing results between them, you would query one model that understands the entire biological context. This is probably five to ten years away for production use, but the foundational research is happening now.

Hands-On: Exploring Latent Spaces with SciRouter

You do not need to build your own neural networks to work with latent spaces. SciRouter exposes the practical outputs of latent space models through simple API calls. Here is how each tool connects to the latent space concepts in this article:

  • Molecule Generator (REINVENT4): Navigate molecular latent space to generate novel compounds with specified properties. You define the destination (your objectives), and the model finds the path.
  • ESMFold: Uses ESM-2's protein latent space to predict 3D structure from sequence. The accuracy of structure prediction is a direct consequence of the quality of the learned latent representation.
  • ProteinMPNN: Navigates protein sequence space to find sequences that fold into a desired backbone structure. This is inverse folding – going from structure to sequence by traversing the latent space in the reverse direction.

Here is a complete example that chains these tools together to explore both protein and molecular latent spaces in a single pipeline:

Multi-space exploration: protein folding + molecule generation + docking
import os, requests, time

API_KEY = os.environ["SCIROUTER_API_KEY"]
BASE = "https://api.scirouter.ai/v1"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}

# --- Step 1: Protein latent space ---
# Fold a kinase domain to get the target structure
print("Step 1: Folding target protein via ESMFold...")
fold_result = requests.post(f"{BASE}/proteins/fold", headers=HEADERS, json={
    "sequence": "MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSY"
        "RKQVVIDGETCLLDILDTAGQEEYSAMRDQYMRTGEGFLCVFA",
    "model": "esmfold",
}).json()

# Wait for fold
while True:
    status = requests.get(
        f"{BASE}/proteins/fold/{fold_result['job_id']}", headers=HEADERS
    ).json()
    if status["status"] == "completed":
        protein_pdb = status["pdb"]
        print(f"  Fold complete. Mean pLDDT: {status['mean_plddt']:.1f}")
        break
    time.sleep(3)

# --- Step 2: Molecular latent space ---
# Generate drug candidates targeting this protein
print("\nStep 2: Generating drug candidates via REINVENT4...")
gen_job = requests.post(f"{BASE}/chemistry/generate", headers=HEADERS, json={
    "model": "reinvent4",
    "num_molecules": 20,
    "objectives": {
        "drug_likeness": {"weight": 1.0, "method": "lipinski"},
        "synthetic_accessibility": {"weight": 0.8, "max_sa_score": 4.0},
        "molecular_weight": {"weight": 0.5, "min": 250, "max": 500},
    },
}).json()

while True:
    gen_result = requests.get(
        f"{BASE}/chemistry/generate/{gen_job['job_id']}", headers=HEADERS
    ).json()
    if gen_result["status"] == "completed":
        print(f"  Generated {len(gen_result['molecules'])} molecules")
        break
    time.sleep(5)

# --- Step 3: Bridge the two spaces ---
# Dock the top 3 generated molecules to the folded protein
print("\nStep 3: Docking top candidates to target protein...")
for i, mol in enumerate(gen_result["molecules"][:3]):
    dock_job = requests.post(f"{BASE}/docking/diffdock", headers=HEADERS, json={
        "protein_pdb": protein_pdb,
        "ligand_smiles": mol["smiles"],
        "num_poses": 3,
    }).json()

    while True:
        dock_result = requests.get(
            f"{BASE}/docking/diffdock/{dock_job['job_id']}", headers=HEADERS
        ).json()
        if dock_result["status"] in ("completed", "failed"):
            break
        time.sleep(3)

    if dock_result["status"] == "completed":
        conf = dock_result["poses"][0]["confidence"]
        print(f"  Molecule {i+1}: {mol['smiles'][:40]}...")
        print(f"    Docking confidence: {conf:.3f}")
    else:
        print(f"  Molecule {i+1}: docking failed")

print("\nPipeline complete. Three latent spaces explored in one script.")
Tip
This pipeline crosses three latent spaces: the ESM-2 protein latent space (for folding), the REINVENT4 molecular latent space (for generation), and the DiffDock interaction space (for docking). Each model has its own internal representation, and the SciRouter API bridges them by converting between biological formats (sequences, SMILES, PDB structures) at each step.

Key Takeaways for Biologists

If you take away three things from this article, let them be these:

  • Latent spaces are learned maps of biological possibility. They are not hand-designed feature sets. They emerge from training on large biological datasets and encode deep patterns that humans cannot easily specify.
  • Navigation is the key concept. Generating a new molecule or protein is not random sampling – it is directed navigation through a structured space toward a region with desired properties. Your design objectives define the destination. The model finds the path.
  • You do not need to build the models yourself. APIs like SciRouter give you access to state-of-the-art latent space models through simple function calls. Understanding the concepts helps you set better objectives and interpret results, but you can start using these tools today without any machine learning expertise.

Next Steps

To start exploring latent spaces in practice, try these guides:

Sign up for a free SciRouter API key and start navigating biological latent spaces today. The map of everything possible is waiting to be explored.

Frequently Asked Questions

What is a latent space in biology?

A latent space is a compressed mathematical representation that an AI model learns from biological data. Think of it as a map where similar molecules or proteins are placed near each other. Each point in the map corresponds to a real or potential molecule, and moving through the space lets you smoothly transition between molecular structures. It is how AI models 'imagine' new biology.

Do I need to understand machine learning math to use latent space tools?

No. Latent space models are accessed through APIs that accept biological inputs (SMILES strings, protein sequences) and return biological outputs (new molecules, designed sequences, predicted structures). You do not need to understand the underlying neural network architecture to use these tools productively. This article explains the concepts so you can make informed decisions about which tools to apply.

What is the difference between a VAE and a diffusion model for molecule generation?

A variational autoencoder (VAE) compresses molecules into a smooth latent space and generates new molecules by sampling from that space. A diffusion model starts with random noise and iteratively refines it into a valid molecule. VAEs are faster and give you direct control over the latent space. Diffusion models tend to produce higher-quality and more diverse outputs but are slower and the latent space is less interpretable.

Can latent space models generate molecules that actually work as drugs?

Yes, with caveats. Generative models can produce molecules with desired computed properties like drug-likeness, synthetic accessibility, and predicted binding affinity. Several AI-designed molecules have entered clinical trials. However, computed properties are predictions, not guarantees. Every AI-generated candidate must still be synthesized and tested experimentally to confirm activity, safety, and manufacturability.

What is latent space arithmetic and does it really work?

Latent space arithmetic means adding and subtracting property vectors in the latent space to create molecules with combined properties. For example, encoding a potent molecule and a non-toxic molecule, then combining their latent vectors, may yield a molecule that is both potent and non-toxic. It works in principle and has been demonstrated in research, but the results are approximate and require experimental validation.

How do protein latent spaces differ from molecular latent spaces?

Molecular latent spaces are learned from small molecules (drugs, fragments) represented as SMILES strings or molecular graphs. Protein latent spaces are learned from amino acid sequences and capture evolutionary, structural, and functional relationships. Protein language models like ESM-2 learn latent representations from hundreds of millions of protein sequences, encoding deep biological knowledge about folding, function, and fitness.

Try this yourself

500 free credits. No credit card required.