What Is DiffDock?
Molecular docking is the computational method of predicting how a small molecule (ligand) binds to a protein target. It sits at the heart of structure-based drug discovery, virtual screening, and lead optimization. For decades, tools like AutoDock Vina have dominated this space using physics-based scoring functions and grid-based search algorithms. They work well, but they come with a significant constraint: you need to tell the software where to look by defining a 3D search box around the suspected binding site.
DiffDock, developed by Gabriele Corso and colleagues at MIT, takes a fundamentally different approach. It is a generative model built on the diffusion framework – the same class of models behind image generators like Stable Diffusion, but adapted for the geometry of molecular interactions. Instead of sampling poses within a predefined box, DiffDock learns the distribution of ligand poses from thousands of experimentally determined protein-ligand complexes in the Protein Data Bank (PDB). During inference, it starts from a random placement and iteratively refines the ligand's position, orientation, and torsion angles through learned denoising steps.
The result is blind docking: you provide a protein structure and a ligand, and DiffDock explores the entire protein surface to find plausible binding sites and poses. No search box definition, no grid calculation, no manual pocket identification. A separate confidence model then ranks the generated poses by predicted geometric accuracy.
Prerequisites
Before running your first DiffDock job, you need two things:
- A ligand in SMILES format: SMILES is a text representation of molecular structure. For example, aspirin is
CC(=O)Oc1ccccc1C(=O)O. If you are unfamiliar with SMILES, see our complete SMILES notation guide. - A protein target: Either a PDB ID (e.g.,
7L10) that the server will fetch from the RCSB Protein Data Bank, or the raw text of a PDB file. The protein should be a single chain or complex with the relevant biological assembly. - A SciRouter API key: Sign up at scirouter.ai to get a free API key with 500 credits. DiffDock jobs are available on the free tier.
Step 1: Set Up Your Environment
All you need is Python with the requests library. No GPU, no conda environment, no DiffDock installation.
import requests
import time
import json
API_KEY = "sk-sci-your-api-key"
BASE = "https://api.scirouter.ai/v1"
headers = {"Authorization": f"Bearer {API_KEY}"}Step 2: Submit a Docking Job
Let's dock imatinib (Gleevec), a well-known tyrosine kinase inhibitor, against the ABL1 kinase domain. Imatinib's SMILES string encodes its full molecular structure, and we can reference the protein by its PDB ID.
# Imatinib (Gleevec) docked against ABL1 kinase
response = requests.post(f"{BASE}/docking/diffdock", headers=headers, json={
"ligand_smiles": "Cc1ccc(NC(=O)c2ccc(CN3CCN(C)CC3)cc2)cc1Nc1nccc(-c2cccnc2)n1",
"protein_pdb_id": "2HYY", # ABL1 kinase domain crystal structure
"num_poses": 10, # Generate 10 candidate binding poses
"num_inference_steps": 20 # Diffusion denoising steps (default: 20)
})
job = response.json()
print(f"Job ID: {job['job_id']}")
print(f"Status: {job['status']}")
# Output: Job ID: dock_abc123... Status: pendingThe num_poses parameter controls how many distinct binding poses DiffDock generates. More poses increase the chance of finding the correct binding mode, but add compute time. For most use cases, 5 to 10 poses strikes a good balance. The num_inference_steps parameter controls the number of diffusion denoising steps; the default of 20 works well for most targets.
Step 3: Poll for Results
DiffDock runs asynchronously on GPU. A typical job completes in 30 seconds to 2 minutes. Poll the job endpoint until the status changes to completed.
# Poll until the job finishes
while True:
result = requests.get(
f"{BASE}/docking/diffdock/{job['job_id']}",
headers=headers
).json()
if result["status"] == "completed":
break
elif result["status"] == "failed":
print(f"Job failed: {result.get('error', 'Unknown error')}")
break
print(f"Status: {result['status']}... waiting")
time.sleep(5)
print(f"Generated {len(result['poses'])} poses")Step 4: Interpret the Results
Each pose in the response includes a confidence score and the docked ligand coordinates in SDF format. The confidence score is the key metric – it predicts how close the generated pose is to the true binding geometry.
# Sort poses by confidence (highest first)
poses = sorted(result["poses"], key=lambda p: p["confidence"], reverse=True)
for i, pose in enumerate(poses):
print(f"Pose {i+1}:")
print(f" Confidence: {pose['confidence']:.3f}")
print(f" Ligand atoms: {pose['num_atoms']}")
print()
# Save the top pose as an SDF file for visualization
with open("top_pose.sdf", "w") as f:
f.write(poses[0]["ligand_sdf"])
# Save the protein with docked ligand for PyMOL/ChimeraX
with open("complex.pdb", "w") as f:
f.write(result["protein_pdb"])Understanding Confidence Scores
DiffDock's confidence model is trained to predict the probability that a generated pose falls within 2 angstroms RMSD of the experimentally determined binding mode. Here is a practical guide for interpreting scores:
- Above 0.7: High confidence. The predicted pose is very likely close to the true binding mode. Use this for downstream analysis with high reliability.
- 0.4 to 0.7: Moderate confidence. The overall binding site is probably correct, but fine details of the pose (specific hydrogen bonds, rotamer states) may be approximate.
- 0.2 to 0.4: Low confidence. The binding site may be correct, but the pose geometry is uncertain. Consider generating more poses or validating with an orthogonal method.
- Below 0.2: Very low confidence. The ligand may not bind at the predicted location. This could indicate a genuinely weak interaction, or the protein structure may need refinement.
Step 5: Batch Docking Multiple Ligands
In a real drug discovery workflow, you typically dock multiple candidate ligands against the same target. Here is how to run batch docking efficiently:
# Define a panel of kinase inhibitors to dock against ABL1
ligands = {
"imatinib": "Cc1ccc(NC(=O)c2ccc(CN3CCN(C)CC3)cc2)cc1Nc1nccc(-c2cccnc2)n1",
"dasatinib": "Cc1nc(Nc2ncc(C(=O)Nc3c(C)cccc3Cl)s2)cc(N2CCN(CCO)CC2)n1",
"nilotinib": "Cc1cn(-c2cc(NC(=O)c3ccc(C)c(Nc4nccc(-c5cccnc5)n4)c3)cc(C(F)(F)F)c2)cn1",
"ponatinib": "Cc1ccc(C(=O)Nc2ccccc2C#Cc2cnc3cccnn23)cc1C#CC1CCN(C)CC1",
}
# Submit all jobs
jobs = {}
for name, smiles in ligands.items():
resp = requests.post(f"{BASE}/docking/diffdock", headers=headers, json={
"ligand_smiles": smiles,
"protein_pdb_id": "2HYY",
"num_poses": 5
}).json()
jobs[name] = resp["job_id"]
print(f"Submitted {name}: {resp['job_id']}")
# Collect results
results = {}
for name, job_id in jobs.items():
while True:
r = requests.get(f"{BASE}/docking/diffdock/{job_id}", headers=headers).json()
if r["status"] in ("completed", "failed"):
results[name] = r
break
time.sleep(3)
# Rank by top confidence score
for name in sorted(results, key=lambda n: max(
p["confidence"] for p in results[n].get("poses", [{"confidence": 0}])
), reverse=True):
top = max(p["confidence"] for p in results[name].get("poses", [{"confidence": 0}]))
print(f"{name}: top confidence = {top:.3f}")DiffDock vs AutoDock Vina: When to Use Which
DiffDock and Vina are complementary tools rather than direct replacements for each other. Here is a practical decision framework:
- Unknown binding site: Use DiffDock. Its blind docking capability explores the entire protein surface without requiring you to guess where the ligand binds.
- Known binding pocket: Use AutoDock Vina. With a well-defined search box, Vina is faster and provides binding affinity estimates in kcal/mol.
- Large-scale virtual screening (10,000+ compounds): Use Vina. It docks a ligand in seconds on CPU, making it practical for screening large libraries.
- Lead optimization (tens of compounds): Use DiffDock for higher-quality pose predictions, then optionally rescore with Vina for affinity ranking.
- Novel targets with no known ligands: Use DiffDock. Its ability to discover cryptic and allosteric binding sites is valuable for targets without established pharmacology.
For a detailed head-to-head comparison with benchmarks, see our DiffDock vs AutoDock Vina comparison.
Visualizing Results
After saving the top pose SDF and protein PDB files, you can visualize the predicted complex in any molecular viewer. Here are quick commands for popular tools:
# Load protein and docked ligand in PyMOL
pymol complex.pdb top_pose.sdf
# In PyMOL console:
# show cartoon, complex
# show sticks, top_pose
# zoom top_poseYou can also use UCSF ChimeraX, Mol* (web-based), or any tool that reads PDB and SDF formats. Look for hydrogen bonds, hydrophobic contacts, and pi-stacking interactions between the ligand and nearby residues to assess whether the predicted pose makes chemical sense.
Advanced Options
Using a Custom PDB File
If your protein is not in the RCSB PDB – for example, a homology model, an AlphaFold prediction, or an ESMFold output – pass the PDB file contents directly:
# Read your custom protein structure
with open("my_model.pdb", "r") as f:
pdb_content = f.read()
response = requests.post(f"{BASE}/docking/diffdock", headers=headers, json={
"ligand_smiles": "CC(=O)Oc1ccccc1C(=O)O", # Aspirin
"protein_pdb": pdb_content, # Raw PDB text
"num_poses": 10
}).json()Controlling Pose Diversity
DiffDock generates diverse poses by running multiple independent diffusion trajectories. If you find that the returned poses are too similar (clustered at one site), increasing num_poses to 20 or more can help explore alternative binding pockets. Keep in mind that more poses increase job time roughly linearly.
Common Pitfalls and Tips
- Large proteins slow things down: DiffDock processes the full protein surface. If your structure has many chains or is a large complex, consider extracting just the relevant chain or domain to speed up inference.
- Validate SMILES first: Invalid SMILES strings will cause the job to fail. Use SciRouter's molecular properties endpoint to validate and get a canonical SMILES before docking.
- Confidence is relative, not absolute: Compare confidence scores within a single docking run (same protein, different poses) rather than across different protein targets.
- Remove water and ligands from PDB: If using a crystal structure, the PDB file may contain crystallographic waters and co-crystallized ligands. For best results, clean the PDB to contain only protein atoms before submitting.
What's Next
You now know how to run AI-powered molecular docking with DiffDock through SciRouter's API. From here, you might want to:
- Screen a compound library by combining DiffDock with AutoDock Vina in a staged pipeline
- Calculate molecular properties for your top hits to assess drug-likeness
- Fold a protein target from sequence using ESMFold, then dock against the predicted structure
All of these tools are available through the same API key. Sign up for free to get started, or explore the DiffDock tool page for full parameter documentation.