What Is Virtual Screening?
Drug discovery starts with a simple but expensive problem: you have a protein target implicated in a disease, and you need to find small molecules that bind to it. The brute-force approach – synthesizing and testing every candidate in the lab – is prohibitively slow and costly. A typical high-throughput screen tests hundreds of thousands of compounds over weeks at a cost of hundreds of thousands of dollars.
Virtual screening flips the economics. Instead of testing molecules physically, you evaluate them computationally – filtering by drug-likeness, predicting binding affinity through molecular docking, and assessing safety with ADMET models. The result is a short list of high-probability candidates that you then validate experimentally. A well-designed virtual screen achieves hit rates of 5–20%, compared to 0.01–0.1% for random experimental screening.
This tutorial walks you through a complete virtual screening pipeline using SciRouter's API. By the end, you will have working Python code that screens 1,000 molecules through property filters, molecular docking, and ADMET prediction – all in about 10 minutes.
Two Approaches: Ligand-Based vs Structure-Based
Before diving into code, it helps to understand the two main strategies for virtual screening:
Ligand-Based Screening
If you know molecules that are active against your target, you can use their properties to find similar compounds. This approach compares molecular fingerprints, physicochemical properties, and pharmacophore features to rank candidates by similarity to known actives. It does not require a 3D protein structure.
- Strengths: Fast, simple, works without a protein structure
- Weaknesses: Biased toward known chemical scaffolds, may miss novel chemotypes
- SciRouter tools: Molecular Properties for filtering, similarity search for ranking
Structure-Based Screening
If you have a 3D structure of the target protein (experimental or predicted), you can computationally dock each candidate molecule into the binding site and score the predicted interaction. This can discover entirely novel scaffolds because it evaluates the physical complementarity between molecule and target.
- Strengths: Can find novel scaffolds, physically motivated scoring
- Weaknesses: Requires a protein structure, more computationally expensive
- SciRouter tools: DiffDock for AI-powered docking
The Pipeline: Five Steps to Screen 1,000 Molecules
Our screening funnel has five stages, each progressively more expensive but more informative:
- Step 1: Prepare your molecule library as SMILES strings
- Step 2: Filter by drug-likeness using molecular properties
- Step 3: Dock surviving candidates against the target protein
- Step 4: Rank by predicted binding affinity
- Step 5: Run ADMET prediction on top hits
Each stage eliminates compounds, so by the time you reach the expensive docking step, your library is already significantly reduced. This cascading filter design is what makes the entire pipeline fast.
Step 1: Prepare Your Molecule Library
Your starting point is a list of SMILES strings representing the molecules you want to screen. These can come from public databases like ZINC (over 200 million purchasable compounds), ChEMBL (bioactive molecules with assay data), or your own proprietary library.
import csv
def load_library(filepath):
"""Load SMILES from a CSV file with columns: name, smiles"""
molecules = []
with open(filepath) as f:
reader = csv.DictReader(f)
for row in reader:
molecules.append({
"name": row["name"],
"smiles": row["smiles"]
})
return molecules
# Load your library
library = load_library("compound_library.csv")
print(f"Loaded {len(library)} molecules")/v1/chemistry/convert endpoint to convert to SMILES first. For details on SMILES syntax, see our DiffDock tutorial which covers input preparation.Step 2: Filter by Drug-Likeness
The first filter eliminates molecules that are unlikely to become oral drugs. We use Lipinski's Rule of Five and related property filters via the molecular properties endpoint. This step is fast (sub-second per molecule) and typically eliminates 30–50% of a random library.
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed
API_KEY = "sk-sci-your-api-key"
BASE = "https://api.scirouter.ai/v1"
def check_properties(molecule):
"""Calculate properties and check drug-likeness."""
resp = requests.post(
f"{BASE}/chemistry/properties",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"smiles": molecule["smiles"]}
)
if resp.status_code != 200:
return None
props = resp.json()
molecule["properties"] = props
# Lipinski's Rule of Five + extra filters
passes = (
props["molecular_weight"] <= 500
and props["logp"] <= 5
and props["h_bond_donors"] <= 5
and props["h_bond_acceptors"] <= 10
and props["tpsa"] <= 140
and props["rotatable_bonds"] <= 10
)
molecule["passes_druglikeness"] = passes
return molecule
# Screen in parallel (20 concurrent requests)
druglike = []
with ThreadPoolExecutor(max_workers=20) as executor:
futures = {
executor.submit(check_properties, mol): mol
for mol in library
}
for future in as_completed(futures):
result = future.result()
if result and result["passes_druglikeness"]:
druglike.append(result)
print(f"Drug-like compounds: {len(druglike)}/{len(library)} "
f"({100*len(druglike)/len(library):.0f}%)")With 20 concurrent threads, this step processes 1,000 molecules in under a minute. The property calculation is CPU-bound and returns in milliseconds per compound.
Step 3: Dock Candidates Against the Target
Now we dock the surviving compounds against our protein target using DiffDock. Unlike traditional docking tools that require you to define a search box, DiffDock uses a diffusion model to explore the entire protein surface – finding binding sites automatically.
def dock_molecule(molecule, protein_pdb_path):
"""Dock a single molecule against the target protein."""
with open(protein_pdb_path) as f:
pdb_content = f.read()
resp = requests.post(
f"{BASE}/docking/diffdock",
headers={"Authorization": f"Bearer {API_KEY}"},
json={
"protein_pdb": pdb_content,
"ligand_smiles": molecule["smiles"],
"num_poses": 5
}
)
if resp.status_code != 200:
return None
result = resp.json()
molecule["docking"] = result
# Best pose score (lower is better for DiffDock confidence)
molecule["best_score"] = min(
pose["confidence"] for pose in result["poses"]
)
return molecule
# Dock all drug-like compounds (10 concurrent to manage GPU load)
target_pdb = "target_protein.pdb"
docked = []
with ThreadPoolExecutor(max_workers=10) as executor:
futures = {
executor.submit(dock_molecule, mol, target_pdb): mol
for mol in druglike
}
for future in as_completed(futures):
result = future.result()
if result is not None:
docked.append(result)
print(f"Successfully docked: {len(docked)}/{len(druglike)}")max_workers based on your API tier rate limits.Step 4: Rank by Binding Affinity
With docking complete, we rank all compounds by their predicted binding score and select the top candidates for the final ADMET assessment:
# Sort by best docking score (lower confidence = better binding)
docked.sort(key=lambda x: x["best_score"])
# Take top 50 candidates
top_candidates = docked[:50]
print("Top 10 candidates by docking score:")
print(f"{'Rank':<6}{'Name':<25}{'Score':<10}{'MW':<8}{'LogP':<6}")
print("-" * 55)
for i, mol in enumerate(top_candidates[:10], 1):
print(f"{i:<6}{mol['name']:<25}"
f"{mol['best_score']:<10.3f}"
f"{mol['properties']['molecular_weight']:<8.1f}"
f"{mol['properties']['logp']:<6.2f}")Step 5: ADMET Prediction on Top Hits
The final filter checks whether your top-ranked binders have acceptable pharmacokinetic and safety profiles. A compound that binds perfectly but is toxic or not absorbed is useless as a drug. We use the ADMET prediction endpoint to flag potential liabilities.
def predict_admet(molecule):
"""Get ADMET profile for a compound."""
resp = requests.post(
f"{BASE}/pharma/adme",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"smiles": molecule["smiles"]}
)
if resp.status_code != 200:
return None
admet = resp.json()
molecule["admet"] = admet
# Flag critical safety issues
molecule["safety_flags"] = []
if admet["toxicity"]["herg_inhibitor"]:
molecule["safety_flags"].append("hERG")
if admet["toxicity"]["ames_mutagenicity"]:
molecule["safety_flags"].append("Ames")
if admet["toxicity"]["hepatotoxicity"] == "high_risk":
molecule["safety_flags"].append("hepatotox")
if admet["absorption"]["caco2_class"] == "low":
molecule["safety_flags"].append("low_absorption")
return molecule
# Run ADMET on top 50 candidates
with ThreadPoolExecutor(max_workers=20) as executor:
futures = {
executor.submit(predict_admet, mol): mol
for mol in top_candidates
}
admet_results = []
for future in as_completed(futures):
result = future.result()
if result is not None:
admet_results.append(result)
# Filter: keep compounds with no critical safety flags
clean_hits = [m for m in admet_results if len(m["safety_flags"]) == 0]
print(f"\nFinal hits (no safety flags): {len(clean_hits)}/{len(admet_results)}")
print(f"\nFinal shortlist:")
for i, mol in enumerate(clean_hits[:10], 1):
print(f" {i}. {mol['name']} "
f"(dock: {mol['best_score']:.3f}, "
f"bioavail: {mol['admet']['absorption']['bioavailability']:.2f})")Putting It All Together
Here is the complete pipeline assembled as a single reusable script:
import requests
import csv
import time
from concurrent.futures import ThreadPoolExecutor, as_completed
API_KEY = "sk-sci-your-api-key"
BASE = "https://api.scirouter.ai/v1"
HEADERS = {"Authorization": f"Bearer {API_KEY}"}
def screen_library(library_path, protein_pdb_path, top_n=50):
"""Full virtual screening pipeline."""
start = time.time()
# Step 1: Load library
library = load_library(library_path)
print(f"[1/5] Loaded {len(library)} molecules")
# Step 2: Drug-likeness filter
druglike = parallel_map(check_properties, library, workers=20)
druglike = [m for m in druglike if m and m["passes_druglikeness"]]
print(f"[2/5] Drug-like: {len(druglike)}")
# Step 3: Dock against target
dock_fn = lambda mol: dock_molecule(mol, protein_pdb_path)
docked = parallel_map(dock_fn, druglike, workers=10)
docked = [m for m in docked if m is not None]
print(f"[3/5] Docked: {len(docked)}")
# Step 4: Rank by binding score
docked.sort(key=lambda x: x["best_score"])
top = docked[:top_n]
print(f"[4/5] Top {len(top)} selected")
# Step 5: ADMET filter
admet_done = parallel_map(predict_admet, top, workers=20)
hits = [m for m in admet_done if m and len(m["safety_flags"]) == 0]
print(f"[5/5] Clean hits: {len(hits)}")
elapsed = time.time() - start
print(f"\nDone in {elapsed/60:.1f} minutes")
return hits
def parallel_map(fn, items, workers=10):
"""Run fn on items in parallel and collect results."""
results = []
with ThreadPoolExecutor(max_workers=workers) as executor:
futures = {executor.submit(fn, item): item for item in items}
for future in as_completed(futures):
try:
results.append(future.result())
except Exception:
pass
return results
# Run the pipeline
hits = screen_library(
"compound_library.csv",
"target_protein.pdb",
top_n=50
)With a 1,000-molecule library, this pipeline typically completes in 8–12 minutes depending on molecule sizes and API tier. The bottleneck is the docking step; the property and ADMET calculations are fast.
Optimizing Your Screen
Several strategies can improve both the speed and quality of your virtual screening results:
- Pre-filter aggressively – The more compounds you eliminate before docking, the faster the pipeline runs. Consider adding filters for PAINS (pan-assay interference compounds), reactive functional groups, and synthetic accessibility.
- Use diversity selection – If your library has many similar molecules, cluster them by fingerprint similarity and pick representatives. This avoids docking redundant compounds.
- Adjust concurrency to your tier – The free tier supports 5 concurrent requests; the Pro tier supports 50. Scale
max_workersaccordingly to maximize throughput without hitting rate limits. - Cache property results – If you screen the same library against multiple targets, cache the drug-likeness and ADMET results. Only the docking step changes with different targets.
- Consider predicted structures – If no experimental structure exists for your target, fold it first with ESMFold and use the predicted structure for docking. Check that binding site residues have pLDDT above 70.
What Comes After Virtual Screening
Virtual screening produces a ranked shortlist, not a confirmed drug. The next steps in a typical workflow are:
- Visual inspection – Examine the top binding poses in a molecular viewer (PyMOL, ChimeraX) to check whether the predicted interactions make chemical sense
- Compound procurement – Order the top hits from chemical vendors (most ZINC compounds are purchasable from Enamine, MolPort, or other suppliers)
- Experimental validation – Test binding experimentally with biophysical assays (SPR, ITC, thermal shift) or functional assays
- Hit expansion – For confirmed hits, search for analogs and repeat the screening cycle to optimize potency, selectivity, and drug-likeness
Next Steps
You now have a complete, working virtual screening pipeline that takes a molecular library from raw SMILES to a prioritized shortlist of drug candidates. The combination of property filters, AI-powered docking, and ADMET prediction creates a multi-layered funnel that catches different failure modes at each stage.
Start by screening your own compound library. If you do not have one, download a subset from ZINC15 or ChEMBL and use the code above to identify potential hits against your target of interest.
Ready to run your first screen? Sign up for a free SciRouter API key and start screening today – 5,000 API calls per month on the free tier.