What Is Virtual Screening?
Drug discovery begins with a target – a protein implicated in disease – and a question: which small molecules bind to it? The brute-force answer is high-throughput screening (HTS): physically testing 500,000 compounds in robotic assays at a cost of $500,000 to $2,000,000 per campaign. The computational answer is virtual screening: using algorithms to predict which compounds are most likely to bind, then testing only the top candidates experimentally.
Virtual screening reduces the experimental burden by 100 to 1,000-fold. Instead of testing 500,000 compounds, you computationally evaluate them and synthesize only the top 50 to 500 for experimental validation. Hit rates from virtual screening (5 to 20% of tested compounds show activity) dramatically outperform random screening (0.01 to 0.1%), because the computational filters concentrate your experimental resources on the most promising chemical matter.
There are three fundamental approaches to virtual screening, each with different requirements and strengths.
High-Throughput Screening (HTS)
HTS is the experimental gold standard: physical testing of large compound libraries against the target in miniaturized assays (384-well or 1536-well plates). It is unbiased – it can discover scaffolds that no computational method would predict – but expensive, slow (3 to 6 months per campaign), and limited by the physical compound collection. Most pharmaceutical companies maintain libraries of 1 to 3 million compounds, a tiny fraction of drug-like chemical space (estimated at 10 to the 60th power molecules).
Structure-Based Virtual Screening (SBVS)
SBVS uses the three-dimensional structure of the target protein to computationally dock candidate molecules into the binding site. Each molecule is placed in the binding pocket in multiple orientations and conformations, and a scoring function estimates the binding affinity. Compounds with favorable docking scores are predicted to be binders. SBVS requires a protein structure (from X-ray crystallography, cryo-EM, or computational prediction) but can discover entirely novel scaffolds that have no structural similarity to known actives.
Ligand-Based Virtual Screening (LBVS)
LBVS uses the properties of known active molecules (ligands) to find similar compounds in large libraries. If you know that compound X binds the target with 100 nM affinity, you search for molecules with similar fingerprints, shape, or pharmacophore features. LBVS does not require a protein structure and is computationally cheaper than docking, but it is biased toward scaffolds similar to known actives and may miss novel chemotypes.
Building Your Compound Library
The starting point for any virtual screening campaign is a compound library. The quality and diversity of your library directly determine the quality of your hits. Here are the major sources.
Public Databases
- ZINC20/ZINC22: Over 1.4 billion purchasable compounds with SMILES and vendor information. The "in-stock" subset (~10 million compounds) contains molecules available for immediate purchase from vendors like Enamine, ChemBridge, and MolPort. Ideal for hit discovery campaigns where you want to test purchased compounds rather than synthesizing new ones.
- Enamine REAL: Over 6 billion make-on-demand compounds that can be synthesized in 2 to 4 weeks using validated one-step or two-step reactions from available building blocks. The largest accessible chemical space database, but compounds must be custom-synthesized after virtual screening identifies them as hits.
- ChEMBL: Over 2.4 million compounds with measured bioactivity data. Useful for assembling focused libraries around specific target classes (for example, all compounds with measured activity against kinases).
- PubChem: Over 110 million compounds from academic screening campaigns. Broader coverage than ChEMBL but with more variable data quality.
Preparing a 1,000-Compound Library
For this tutorial, we will build a focused library of 1,000 compounds targeting EGFR (Epidermal Growth Factor Receptor), a validated oncology target with approved drugs (erlotinib, gefitinib, osimertinib). We start from a ChEMBL extract of known EGFR-active scaffolds and generate diverse analogs.
import os, requests, time
API_KEY = os.environ["SCIROUTER_API_KEY"]
BASE = "https://api.scirouter.ai/v1"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
# Seed compounds: known EGFR inhibitor scaffolds
seed_compounds = [
# Erlotinib-like (quinazoline)
"COCOc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OCOC",
# Gefitinib-like (4-anilinoquinazoline)
"COc1cc2ncnc(Nc3ccc(F)c(Cl)c3)c2cc1OC",
# Lapatinib-like (pyridopyrimidine)
"CS(=O)(=O)CCNCc1ccc(-c2ccc3ncnc(Nc4ccc(OCc5cccc(F)c5)c(Cl)c4)c3c2)o1",
]
# Generate diverse analogs from each seed
library = []
for seed in seed_compounds:
job = requests.post(f"{BASE}/chemistry/generate", headers=HEADERS, json={
"model": "reinvent4",
"num_molecules": 350,
"objectives": {
"similarity": {
"weight": 0.5,
"reference_smiles": seed,
"min_similarity": 0.2,
"max_similarity": 0.7,
},
"drug_likeness": {"weight": 1.0, "method": "qed"},
"synthetic_accessibility": {"weight": 0.6, "max_sa_score": 5.5},
},
}).json()
while True:
result = requests.get(
f"{BASE}/chemistry/generate/{job['job_id']}", headers=HEADERS
).json()
if result["status"] in ("completed", "failed"):
break
time.sleep(10)
if result["status"] == "completed":
library.extend(result["molecules"])
# Deduplicate by SMILES
seen = set()
unique_library = []
for mol in library:
if mol["smiles"] not in seen:
seen.add(mol["smiles"])
unique_library.append(mol)
# Trim to exactly 1,000
library = unique_library[:1000]
print(f"Library size: {len(library)} unique compounds from "
f"{len(seed_compounds)} seed scaffolds")Target Preparation with Pocket Detection
Structure-based virtual screening requires a 3D protein structure with a defined binding site. For EGFR, multiple crystal structures are available in the PDB (for example, PDB 1M17 with erlotinib bound). If no experimental structure is available, you can predict one using ESMFold or AlphaFold2 via the SciRouter API.
Even with a known structure, binding pocket detection is valuable for identifying the exact coordinates and residues that define the docking region. SciRouter's pocket detection endpoint analyzes the protein surface geometry to identify druggable cavities and their properties.
# EGFR kinase domain sequence (for structure prediction if needed)
egfr_sequence = (
"FKKIKVLGSGAFGTVYKGLWIPEGEKVKIPVAIKELREATSPKANKEILDEAY"
"VMASVDNPHVCRLLGICLTSTVQLITQLMPFGCLLDYVREHKDNIGSQYLLNW"
"CVQIAKGMNYLEDRRLVHRDLAARNVLVKTPQHVKITDFGLAKLLGAEEKEYHAE"
"GGKVPIKWMALESILHRIYTHQSDVWSYGVTVWELMTFGSKPYDGIPASEISSILE"
"KGERLPQPPICTIDVYMIMVKCWMIDADSRPKFRELIIEFSKMARDPQRYLVIQGDERM"
)
# If no crystal structure, predict with ESMFold
fold_job = requests.post(f"{BASE}/proteins/fold", headers=HEADERS,
json={"sequence": egfr_sequence, "model": "esmfold"}).json()
# Poll for completion
while True:
fold_result = requests.get(
f"{BASE}/proteins/fold/{fold_job['job_id']}", headers=HEADERS
).json()
if fold_result["status"] in ("completed", "failed"):
break
time.sleep(5)
pdb_string = fold_result["pdb"]
print(f"Structure predicted: {len(egfr_sequence)} residues")
print(f"Average pLDDT: {fold_result['metrics']['plddt_mean']:.1f}")
# Detect binding pockets
pockets = requests.post(f"{BASE}/proteins/pocket-detection",
headers=HEADERS, json={"pdb": pdb_string}).json()
print(f"\nDetected {len(pockets['pockets'])} pockets:")
for i, pocket in enumerate(pockets["pockets"]):
print(f" Pocket {i+1}: Volume={pocket['volume']:.0f} A^3, "
f"Druggability={pocket['druggability_score']:.2f}, "
f"Residues={len(pocket['residues'])}")
# Use the most druggable pocket for docking
best_pocket = max(pockets["pockets"], key=lambda p: p["druggability_score"])
print(f"\nSelected pocket: Volume={best_pocket['volume']:.0f} A^3, "
f"Center=({best_pocket['center_x']:.1f}, "
f"{best_pocket['center_y']:.1f}, "
f"{best_pocket['center_z']:.1f})")Running the Batch Screen
With the library prepared and the target structure ready, run the virtual screening pipeline. The strategy is a multi-stage funnel: first, filter by drug-like properties (fast, CPU-based); then dock the filtered set against the target (slower, GPU-based); finally, apply ADMET filters to the top docking hits.
Stage 1: Property Pre-Filter
Remove compounds that violate basic drug-likeness criteria before investing computational resources in docking. This step is fast (sub-second per compound) and typically eliminates 10 to 30% of the library.
# Stage 1: Property pre-filter
filtered_library = []
for mol in library:
props = requests.post(f"{BASE}/chemistry/properties",
headers=HEADERS, json={"smiles": mol["smiles"]}).json()
# Lipinski-like filters (relaxed for kinase inhibitors)
passes = True
if props["molecular_weight"] > 600:
passes = False
if props["logp"] > 6.0 or props["logp"] < -1.0:
passes = False
if props["h_bond_donors"] > 5:
passes = False
if props["h_bond_acceptors"] > 10:
passes = False
if props["rotatable_bonds"] > 12:
passes = False
if passes:
mol["properties"] = props
filtered_library.append(mol)
print(f"Stage 1: {len(filtered_library)}/{len(library)} pass "
f"property filters ({len(filtered_library)/len(library)*100:.0f}%)")Stage 2: Molecular Docking with DiffDock
Dock the filtered compounds against the EGFR binding pocket using DiffDock. DiffDock is a diffusion-based molecular docking method that predicts binding poses and confidence scores without requiring a predefined binding box. For large screening campaigns, submit molecules in parallel using concurrent API calls.
from concurrent.futures import ThreadPoolExecutor, as_completed
def dock_molecule(mol):
"""Dock a single molecule against EGFR and return the result."""
try:
dock_result = requests.post(f"{BASE}/docking/diffdock",
headers=HEADERS, json={
"protein_pdb": pdb_string,
"ligand_smiles": mol["smiles"],
"num_poses": 5,
}, timeout=120).json()
# Poll for async result
job_id = dock_result.get("job_id")
if job_id:
for _ in range(30):
status = requests.get(
f"{BASE}/docking/diffdock/{job_id}",
headers=HEADERS
).json()
if status["status"] == "completed":
return {
"smiles": mol["smiles"],
"properties": mol["properties"],
"top_confidence": status["poses"][0]["confidence"],
"top_pose_score": status["poses"][0]["score"],
"num_poses": len(status["poses"]),
}
if status["status"] == "failed":
return None
time.sleep(5)
return None
except Exception:
return None
# Run docking in parallel (10 concurrent workers)
docking_results = []
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(dock_molecule, mol): mol
for mol in filtered_library}
completed = 0
for future in as_completed(futures):
completed += 1
result = future.result()
if result:
docking_results.append(result)
if completed % 50 == 0:
print(f"Docked {completed}/{len(filtered_library)} "
f"({len(docking_results)} successful)")
# Sort by docking confidence
docking_results.sort(key=lambda x: x["top_confidence"], reverse=True)
print(f"\nStage 2: {len(docking_results)} docking results")
print(f"Top 10 docking scores:")
for i, r in enumerate(docking_results[:10]):
print(f" {i+1}. Confidence: {r['top_confidence']:.3f} | "
f"Score: {r['top_pose_score']:.2f} | "
f"{r['smiles'][:50]}")Stage 3: Consensus Scoring
To improve hit rates, run a second docking method (AutoDock Vina) on the top 100 DiffDock hits and select compounds that score well by both methods. Consensus scoring reduces false positives because different methods have different failure modes – a compound that happens to score well due to a scoring function artifact in one method is unlikely to score well in the other.
# Take top 100 from DiffDock for consensus scoring
top_100 = docking_results[:100]
def vina_dock(mol):
"""Dock with AutoDock Vina for consensus scoring."""
try:
result = requests.post(f"{BASE}/docking/autodock-vina",
headers=HEADERS, json={
"protein_pdb": pdb_string,
"ligand_smiles": mol["smiles"],
"center_x": best_pocket["center_x"],
"center_y": best_pocket["center_y"],
"center_z": best_pocket["center_z"],
"box_size": 25.0,
"exhaustiveness": 16,
}, timeout=120).json()
job_id = result.get("job_id")
if job_id:
for _ in range(20):
status = requests.get(
f"{BASE}/docking/autodock-vina/{job_id}",
headers=HEADERS
).json()
if status["status"] == "completed":
mol["vina_score"] = status["poses"][0]["affinity_kcal"]
return mol
if status["status"] == "failed":
return None
time.sleep(5)
return None
except Exception:
return None
# Run Vina docking in parallel
consensus_results = []
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {executor.submit(vina_dock, mol): mol for mol in top_100}
for future in as_completed(futures):
result = future.result()
if result and "vina_score" in result:
consensus_results.append(result)
print(f"Consensus scoring: {len(consensus_results)} compounds docked "
f"with both DiffDock and Vina")
# Rank by consensus: normalize both scores and average
if consensus_results:
max_conf = max(r["top_confidence"] for r in consensus_results)
min_vina = min(r["vina_score"] for r in consensus_results)
max_vina = max(r["vina_score"] for r in consensus_results)
for r in consensus_results:
norm_diffdock = r["top_confidence"] / max_conf
# Vina scores are negative (more negative = better)
norm_vina = (r["vina_score"] - max_vina) / (min_vina - max_vina)
r["consensus_score"] = (norm_diffdock + norm_vina) / 2
consensus_results.sort(key=lambda x: x["consensus_score"], reverse=True)
print("\nTop 20 by consensus score:")
for i, r in enumerate(consensus_results[:20]):
print(f" {i+1}. Consensus: {r['consensus_score']:.3f} | "
f"DiffDock: {r['top_confidence']:.3f} | "
f"Vina: {r['vina_score']:.1f} kcal/mol")
print(f" {r['smiles'][:60]}")Post-Screening ADMET Filter
The final stage applies ADMET prediction to the top consensus hits. A compound with excellent docking scores but poor pharmacokinetics or safety liabilities is not a viable drug candidate. ADMET filtering after docking (rather than before) ensures that you do not prematurely eliminate compounds that might have acceptable ADMET profiles despite borderline property values.
# ADMET filter on top 50 consensus hits
top_50 = consensus_results[:50]
final_candidates = []
for mol in top_50:
admet = requests.post(f"{BASE}/chemistry/admet",
headers=HEADERS, json={"smiles": mol["smiles"]}).json()
synth = requests.post(f"{BASE}/chemistry/synthesis-check",
headers=HEADERS, json={"smiles": mol["smiles"]}).json()
# ADMET hard filters
passes = True
if admet["herg_inhibition"] == "high":
passes = False
if admet["hepatotoxicity"] == "high":
passes = False
if synth["sa_score"] > 6.0:
passes = False
if passes:
mol["admet"] = admet
mol["sa_score"] = synth["sa_score"]
final_candidates.append(mol)
print(f"Stage 4: {len(final_candidates)}/{len(top_50)} pass ADMET filters")
print(f"\n=== Final Candidates for Experimental Validation ===")
for i, c in enumerate(final_candidates[:15]):
print(f"\n{i+1}. {c['smiles']}")
print(f" Consensus: {c['consensus_score']:.3f} | "
f"Vina: {c['vina_score']:.1f} kcal/mol")
print(f" MW: {c['properties']['molecular_weight']:.0f} | "
f"LogP: {c['properties']['logp']:.1f} | "
f"SA: {c['sa_score']:.1f}")
print(f" hERG: {c['admet']['herg_inhibition']} | "
f"Hepatotox: {c['admet']['hepatotoxicity']} | "
f"Oral F: {c['admet']['oral_bioavailability']}")Case Study: EGFR Kinase Virtual Screen
Let us walk through the expected results for an EGFR kinase virtual screening campaign using the pipeline described above.
Starting library: 1,000 compounds generated from three seed scaffolds (quinazoline, 4-anilinoquinazoline, pyridopyrimidine) with Tanimoto similarity 0.2 to 0.7 to the seeds. This produces a focused but diverse library spanning the known EGFR inhibitor chemical space plus novel variations.
Stage 1 (property filter): 847 of 1,000 compounds pass Lipinski-relaxed filters (15% attrition). The failures are mostly high-molecular-weight analogs from the lapatinib seed with excessive rotatable bonds or LogP values.
Stage 2 (DiffDock): 847 compounds are docked against the EGFR kinase domain (PDB 1M17 or predicted structure). The top 100 by confidence score include compounds from all three seed scaffolds, with the quinazoline derivatives showing the highest average confidence (they most closely resemble erlotinib, which crystallizes well in the EGFR pocket).
Stage 3 (consensus scoring): The top 100 DiffDock hits are re-docked with AutoDock Vina. Of these, 73 receive Vina scores better than -8.0 kcal/mol (a typical threshold for kinase inhibitors). The top 20 by consensus score all have DiffDock confidence above 0.7 and Vina affinity below -9.5 kcal/mol.
Stage 4 (ADMET filter): Of the top 50 consensus hits, 31 pass all ADMET hard filters (no high hERG risk, no high hepatotoxicity, SA score below 6.0). The most common failure mode is hERG liability (11 compounds) driven by high lipophilicity in the lapatinib-derived analogs. The final 31 candidates are ready for procurement or synthesis and experimental testing in an EGFR kinase assay.
Expected hit rate: Based on benchmark studies, 15 to 25% of the final candidates (5 to 8 compounds) would be expected to show measurable EGFR inhibition (IC50 below 10 micromolar) in a biochemical kinase assay. Of these, 1 to 3 would show sub-micromolar potency suitable for lead optimization.
Optimizing Your Screening Campaign
The pipeline described above is a starting point. Here are practical optimizations for different scenarios.
- Increase throughput with batch endpoints: SciRouter's property and ADMET endpoints accept batch requests of up to 1,000 SMILES per call. Use batch mode instead of individual requests to reduce API overhead. A batch property calculation for 1,000 compounds completes in under 30 seconds.
- Use pocket-focused docking for speed: If you know the binding pocket location (from a co-crystal structure or pocket detection), specify the pocket center and box size in the Vina endpoint. This reduces the search space and speeds up docking by 3 to 5-fold compared to blind docking.
- Iterative enrichment: After the first screen, analyze the SAR of the top hits. If a specific scaffold or substitution pattern is over-represented in the hits, generate 500 focused analogs of that scaffold and run a second screening round. This iterative enrichment narrows the chemical space toward the most productive region.
- Fragment-based screening: For novel targets without known actives, start with a fragment library (MW 150 to 300, 1,000 to 5,000 fragments). Fragments have simpler binding modes and higher hit rates (typically 3 to 5% in experimental screens). Dock fragments, identify the top binders, and then grow them into drug-like molecules using the molecule generator.
- Selectivity counter-screens: After identifying hits for your target, dock the same compounds against known off-targets (related family members, hERG channel structure). Select compounds that dock well to the target but poorly to off-targets. This builds selectivity into the screening campaign from the start.
Putting It All Together
Virtual screening is the computational foundation of modern drug discovery. The multi-stage funnel described in this tutorial – property filter, molecular docking with DiffDock, consensus scoring with AutoDock Vina, and ADMET post-filtering – transforms a library of 1,000 diverse compounds into a shortlist of 15 to 30 high-confidence candidates for experimental validation.
SciRouter's API provides every component of this pipeline through a unified interface: DiffDock and AutoDock Vina for molecular docking, pocket detection for binding site identification, and ADMET-AI for safety and pharmacokinetics prediction. The entire campaign – from library generation to final candidate selection – can be run programmatically in a single Python script, completing in 10 to 30 minutes depending on library size and API tier.
The traditional alternative – setting up DiffDock on a local GPU, configuring AutoDock Vina, running RDKit for property calculations, and integrating ADMET models – takes days of infrastructure setup before you can run your first screen. With SciRouter, the first API call can go out in under a minute. The infrastructure is the API; the science is your focus.