Why Protein Structure Matters
Every protein in your body is a molecular machine, and its function is determined almost entirely by its three-dimensional shape. Hemoglobin carries oxygen because of the precise geometry of its heme-binding pocket. Antibodies recognize pathogens because their variable loops fold into complementary surfaces. Enzymes catalyze reactions because their active sites position substrates with sub-angstrom precision.
For decades, determining protein structure required experimental methods like X-ray crystallography or cryo-EM – techniques that cost thousands of dollars per structure and take months to years. The computational revolution that began with AlphaFold2 in 2020 changed this fundamentally. Today, you can predict a protein's structure from its amino acid sequence in seconds, and you can do it with a single API call.
This guide walks you through the landscape of protein structure prediction tools, explains the key concepts you need to understand, and shows you how to predict your first structure using SciRouter's API in about ten lines of Python.
The Protein Structure Prediction Landscape
Three tools dominate the current landscape. Each takes a fundamentally different approach to the same problem, and understanding those differences will help you choose the right tool for your work.
AlphaFold2: The MSA-Based Gold Standard
AlphaFold2, developed by DeepMind, uses multiple sequence alignments (MSAs) as its primary input signal. It searches large databases of protein sequences (UniRef90, MGnify, BFD) to find evolutionary relatives of your query protein, aligns them, and extracts co-evolutionary patterns. These patterns reveal which residues are spatially close in the 3D structure, because residues that contact each other tend to co-evolve.
The result is exceptional accuracy – AlphaFold2 achieved a median GDT-TS above 90 on CASP14 targets, essentially solving the protein folding problem for single-chain proteins with known homologs. The trade-off is speed: MSA construction requires searching terabytes of sequence databases, which takes minutes to hours per protein.
ESMFold: Speed Through Language Models
ESMFold from Meta AI takes a radically different approach. Instead of building MSAs, it uses ESM-2, a protein language model trained on millions of protein sequences. The model learns evolutionary information implicitly during pre-training, so at inference time it needs only the single input sequence – no database search required.
This makes ESMFold dramatically faster: a typical prediction completes in 5 to 15 seconds rather than minutes to hours. Accuracy is within striking distance of AlphaFold2 for proteins with many homologs, though it drops off for orphan proteins where the language model has less implicit evolutionary context. For high-throughput screening and rapid prototyping, ESMFold is often the best first choice.
Boltz-2: Complex Prediction for the Real World
Boltz-2 from MIT addresses a limitation of both AlphaFold2 and ESMFold: predicting multi-chain complexes. Proteins rarely act alone. They bind other proteins, small-molecule ligands, DNA, and RNA. Boltz-2 can model all of these interactions in a single prediction.
Boltz-2 accepts multiple chains as input and predicts how they arrange in space relative to each other. It handles protein-protein interfaces, protein-ligand binding, and protein-nucleic acid complexes. Prediction time ranges from 30 seconds to several minutes depending on the number and size of chains. For a deeper comparison of all three tools, see our ESMFold vs AlphaFold2 vs Boltz-2 comparison.
Key Concepts Before You Start
Amino Acid Sequences
Protein structure prediction starts with a sequence of amino acids represented as a string of single-letter codes. For example, the first 20 residues of human hemoglobin subunit alpha are MVLSPADKTNVKAAWGKVGA. There are 20 standard amino acids, each with a one-letter code (A for alanine, M for methionine, and so on). Your input sequence must use this standard alphabet.
pLDDT: Your Confidence Metric
Every prediction comes with a per-residue confidence score called pLDDT (predicted Local Distance Difference Test). This score ranges from 0 to 100 and tells you how reliable each part of the predicted structure is:
- Above 90: High confidence. The backbone and side-chain positions are likely accurate.
- 70 to 90: Moderate confidence. The backbone fold is probably correct, but side-chain details may vary.
- 50 to 70: Low confidence. The predicted structure in this region should be interpreted with caution.
- Below 50: Very low confidence. These regions are likely intrinsically disordered and do not adopt a stable 3D structure.
PDB Format
Predicted structures are returned in PDB (Protein Data Bank) format, a text-based format that lists the 3D coordinates of every atom in the protein. PDB files can be visualized in tools like PyMOL, ChimeraX, Mol*, or any molecular viewer. The API returns the PDB content as a string that you can save directly to a file.
Your First Structure Prediction in 10 Lines of Python
Let's predict the structure of a real protein. We'll use the sequence of human ubiquitin, a small, well-characterized protein with 76 residues. This is a great test case because its structure is known experimentally, so you can verify the prediction.
import requests, time
API_KEY = "sk-sci-your-api-key"
BASE = "https://api.scirouter.ai/v1"
headers = {"Authorization": f"Bearer {API_KEY}"}
# Human ubiquitin sequence (76 residues)
sequence = "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG"
# Submit the folding job
job = requests.post(f"{BASE}/proteins/fold",
headers=headers,
json={"sequence": sequence, "model": "esmfold"}).json()
# Poll until complete
while job["status"] != "completed":
time.sleep(2)
job = requests.get(f"{BASE}/proteins/fold/{job['job_id']}",
headers=headers).json()
# Save the predicted structure
with open("ubiquitin_predicted.pdb", "w") as f:
f.write(job["result"]["pdb_string"])
print(f"Mean pLDDT: {job['result']['mean_plddt']:.1f}")
print(f"Structure saved to ubiquitin_predicted.pdb")That's it. The response includes the predicted 3D coordinates in PDB format, a mean pLDDT score for overall confidence, and per-residue pLDDT values so you can identify which regions are well-predicted and which are uncertain.
Understanding Your Results
Interpreting pLDDT Scores
For ubiquitin, you should see a mean pLDDT in the high 80s or 90s – this is a well-studied protein with many homologs, so ESMFold predicts it with high confidence. Here is how to extract and analyze the per-residue scores:
import requests
API_KEY = "sk-sci-your-api-key"
BASE = "https://api.scirouter.ai/v1"
headers = {"Authorization": f"Bearer {API_KEY}"}
# After job completes, inspect per-residue pLDDT
plddt_scores = job["result"]["plddt_per_residue"]
# Find high-confidence and low-confidence regions
high_conf = [i+1 for i, s in enumerate(plddt_scores) if s > 90]
low_conf = [i+1 for i, s in enumerate(plddt_scores) if s < 50]
print(f"High-confidence residues (>90): {len(high_conf)} of {len(plddt_scores)}")
print(f"Potentially disordered residues (<50): {len(low_conf)}")
# Identify stretches of disorder
if low_conf:
print(f"Low-confidence positions: {low_conf}")
else:
print("No disordered regions detected")Visualizing the Structure
The PDB file you saved can be loaded into any molecular visualization tool. For quick inspection, web-based viewers like Mol* (used by the RCSB PDB) work well. For publication figures, PyMOL or ChimeraX give you more control over rendering. Color by pLDDT to see confidence mapped directly onto the structure – blue for high confidence, red for low.
Going Further: Batch Processing
One of the biggest advantages of API-based structure prediction is automation. Instead of submitting one sequence at a time through a web form, you can script batch predictions over hundreds or thousands of sequences:
import requests, time
API_KEY = "sk-sci-your-api-key"
BASE = "https://api.scirouter.ai/v1"
headers = {"Authorization": f"Bearer {API_KEY}"}
sequences = {
"ubiquitin": "MQIFVKTLTGKTITLEVEPSDTIENVKAKIQDKEGIPPDQQRLIFAGKQLEDGRTLSDYNIQKESTLHLVLRLRGG",
"insulin_a": "GIVEQCCTSICSLYQLENYCN",
"lysozyme_fragment": "KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAK",
}
jobs = {}
for name, seq in sequences.items():
resp = requests.post(f"{BASE}/proteins/fold", headers=headers,
json={"sequence": seq, "model": "esmfold"}).json()
jobs[name] = resp["job_id"]
print(f"Submitted {name}: job {resp['job_id']}")
# Poll all jobs
for name, job_id in jobs.items():
while True:
result = requests.get(f"{BASE}/proteins/fold/{job_id}",
headers=headers).json()
if result["status"] == "completed":
with open(f"{name}.pdb", "w") as f:
f.write(result["result"]["pdb_string"])
print(f"{name}: pLDDT = {result['result']['mean_plddt']:.1f}")
break
time.sleep(2)When to Use Complex Prediction
If your protein interacts with other molecules – another protein chain, a small-molecule drug, or a nucleic acid – single-chain prediction only tells part of the story. The Boltz-2 endpoint on SciRouter lets you submit multiple chains and predict how they assemble into a complex:
import requests
API_KEY = "sk-sci-your-api-key"
BASE = "https://api.scirouter.ai/v1"
headers = {"Authorization": f"Bearer {API_KEY}"}
# Predict a two-chain complex
job = requests.post(f"{BASE}/proteins/fold", headers=headers,
json={
"sequences": [
"MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH",
"MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLST"
],
"model": "boltz2"
}).json()
print(f"Complex prediction submitted: {job['job_id']}")Practical Tips for Better Predictions
- Sequence length matters. ESMFold handles sequences up to about 1024 residues well. For longer proteins, consider splitting into domains.
- Remove signal peptides. If your sequence includes a signal peptide or transit peptide, remove it before prediction. These regions are cleaved in vivo and will produce low-confidence noise.
- Check for non-standard residues. Replace selenomethionine (U) with methionine (M) and pyrrolysine (O) with lysine (K) before submitting.
- Use pLDDT for triage. If mean pLDDT is below 60, the prediction may not be reliable enough for downstream analysis like docking or active site characterization.
- Validate against known structures. When possible, compare predictions against experimental structures in the PDB to calibrate your expectations.
Next Steps
Now that you can predict structures, consider what you can do with them. Predicted structures are inputs to molecular docking (finding how drugs bind), binding site analysis, protein engineering, and phylogenetic studies. Read our comparison of ESMFold, AlphaFold2, and Boltz-2 to understand which tool to use for different scenarios, or explore the ESMFold and Boltz-2 tool pages for detailed API documentation.
Sign up for a free API key and start predicting structures today. No GPU, no database downloads, no Docker containers – just send a sequence and get a structure back.