ProteinsProtein Engineering

ProteinMPNN Tutorial: Design New Protein Sequences for Any Backbone Structure

Step-by-step ProteinMPNN tutorial: design protein sequences via API, validate with ESMFold round-trip folding, and run full engineering workflows with the SciRouter Python SDK.

Ryan Bethencourt
April 8, 2026
10 min read

What Is Inverse Folding – and Why It Matters

Protein structure prediction answers the question: given a sequence of amino acids, what three-dimensional shape does the protein adopt? Tools like ESMFold and AlphaFold2 solve this problem with remarkable accuracy. But for protein engineering, you need the reverse: given a desired three-dimensional shape, what amino acid sequence will fold into it?

This is the inverse folding problem, and it is the foundation of computational protein design. If you can solve inverse folding reliably, you can engineer proteins with specific structures – enzymes with precisely shaped active sites, binders that dock onto a target surface, switches that change conformation in response to signals. You are no longer limited to the sequences that evolution happened to produce. You can design new ones from scratch.

ProteinMPNN, developed by Justas Dauparas and the David Baker lab at the University of Washington, is the current state of the art for inverse folding. Published in Sciencein 2022, it uses a message-passing neural network (MPNN) architecture that reads backbone atom coordinates and outputs amino acid probabilities at each position. In experimental validation, ProteinMPNN-designed sequences fold into the target structure roughly 50–70% of the time – a dramatic improvement over Rosetta's 10–30% success rate that had been the benchmark for two decades.

This tutorial walks through the complete ProteinMPNN workflow using SciRouter's API: obtaining a backbone structure, running ProteinMPNN with various parameters, interpreting the output, validating designs with round-trip ESMFold folding, and integrating everything into a Python engineering pipeline. By the end, you will have the tools to design protein sequences for any backbone structure.

ProteinMPNN vs RFDiffusion vs Chroma: Choosing the Right Tool

Before diving into the tutorial, it helps to understand where ProteinMPNN fits in the landscape of protein design tools. Three major approaches dominate computational protein design in 2026, and they solve different problems.

FeatureProteinMPNNRFDiffusionChroma
TaskSequence design for fixed backboneBackbone generation (de novo)Backbone + sequence co-design
InputPDB backbone coordinatesConditioning signals (hotspots, symmetry)Natural language or constraints
OutputAmino acid sequencesBackbone coordinates (no sequence)Full structure + sequence
ArchitectureMessage-passing neural networkDenoising diffusion (RoseTTAFold)Denoising diffusion (custom GNN)
Speed~2-5 seconds per design~30-120 seconds per backbone~60-180 seconds per design
Experimental success rate50-70% fold correctlyVaries (needs ProteinMPNN for sequence)~40-60% (less validated)
Best forRedesigning existing proteinsCreating novel backbonesExploratory design with constraints
GPU requirementA5000 / RTX 4090 classA100 40GB+A100 40GB+
API access via SciRouterYesComing soonNot available

The key insight: ProteinMPNN and RFDiffusion are complementary, not competing. A typical de novo design pipeline first uses RFDiffusion to generate a novel backbone, then ProteinMPNN to design a sequence for that backbone, then ESMFold to validate that the designed sequence folds back into the intended structure. ProteinMPNN is also used standalone when you want to redesign an existing protein – stabilizing it, changing its solubility, or creating variants while preserving the fold.

Chroma (from Generate Biomedicines) takes a different approach by co-designing backbone and sequence simultaneously, with natural language conditioning. It is powerful for exploratory design but currently less well-validated experimentally and not available via API. For production protein engineering workflows, the ProteinMPNN + ESMFold combination remains the most reliable and accessible approach.

Step 1: Getting a Backbone Structure

ProteinMPNN requires a protein backbone as input – specifically, the 3D coordinates of backbone atoms (N, CA, C, O) at each residue position. There are three main ways to obtain a backbone structure for your design project.

Option A: Download from the PDB

If you want to redesign an existing protein, download its structure from the Protein Data Bank. For this tutorial, we will use two well-characterized proteins:

  • 1QYS – Green Fluorescent Protein (GFP), a 238-residue beta-barrel widely used as a reporter in biology
  • 4LYZ – Hen egg-white lysozyme, a 129-residue enzyme that is one of the most extensively studied proteins in biochemistry
python
import requests

# Download PDB structure
def download_pdb(pdb_id: str) -> str:
    """Download a PDB file and return as string."""
    url = f"https://files.rcsb.org/download/{pdb_id}.pdb"
    response = requests.get(url)
    response.raise_for_status()
    return response.text

# Get lysozyme structure
pdb_content = download_pdb("4LYZ")
print(f"Downloaded 4LYZ: {len(pdb_content)} bytes")

# Save locally
with open("4lyz.pdb", "w") as f:
    f.write(pdb_content)

Option B: Predict with ESMFold

If you have a sequence but no experimentally determined structure, use ESMFold to predict the backbone. This is common when working with natural protein variants, metagenomic sequences, or computationally designed starting points.

python
import scirouter

client = scirouter.SciRouter(api_key="sk-sci-YOUR_KEY")

# Predict structure from sequence
result = client.proteins.fold(
    sequence="MSKGEELFTGVVPILVELDGDVNGHKFSVRGEGEGDATNGKLTLKFICTTGKLPVPWPTLVTTLTYGVQCFSRYPDHMKRHDFFKSAMPEGYVQERTISFKDDGTYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNFNSHNVYITADKQKNGIKANFKIRHNVEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSKLSKDPNEKRDHMVLLEFVTAAGITHGMDELYK",
    model="esmfold"
)

# Wait for the job to complete
job = client.proteins.get_job(result.job_id)
print(f"ESMFold job status: {job.status}")
print(f"Mean pLDDT: {job.mean_plddt:.1f}")

# Save the predicted PDB for ProteinMPNN input
with open("gfp_predicted.pdb", "w") as f:
    f.write(job.pdb_string)

Option C: De Novo Backbone from RFDiffusion

For entirely new protein designs, generate a novel backbone using RFDiffusion or similar tools, then use ProteinMPNN to design a sequence for it. This two-step approach – backbone generation followed by inverse folding – is the standard pipeline for de novo protein design. Since RFDiffusion outputs backbone coordinates without sequences, ProteinMPNN is an essential second step.

Step 2: Running ProteinMPNN via SciRouter

With a backbone structure in hand, you can now run ProteinMPNN. The SciRouter API exposes all of ProteinMPNN's key parameters: temperature, number of sequences, fixed residue positions, and chain selection.

Basic Design: Redesign the Entire Protein

python
import scirouter

client = scirouter.SciRouter(api_key="sk-sci-YOUR_KEY")

# Read the PDB file
with open("4lyz.pdb", "r") as f:
    pdb_string = f.read()

# Run ProteinMPNN - redesign all positions
result = client.design.proteinmpnn(
    pdb_string=pdb_string,
    num_sequences=8,       # Generate 8 sequence variants
    temperature=0.1,       # Low temp = high confidence, low diversity
)

# Examine the designed sequences
for i, seq in enumerate(result.sequences):
    print(f"\nDesign {i+1}:")
    print(f"  Sequence: {seq.sequence[:60]}...")
    print(f"  Score: {seq.score:.3f}")
    print(f"  Recovery: {seq.sequence_recovery:.1%}")

The score is ProteinMPNN's negative log-likelihood – lower scores indicate higher confidence that the sequence will fold into the target structure. The sequence_recovery is the fraction of positions matching the native sequence. At temperature 0.1, expect recovery around 50–55% for natural proteins and scores around -1.5 to -2.5.

Advanced: Fixed Residues and Chain Selection

In real engineering projects, you rarely want to redesign every residue. Catalytic residues, disulfide bonds, and known binding contacts should be held fixed while the surrounding sequence is optimized. ProteinMPNN supports this through the fixed_positions parameter.

python
# Lysozyme active site residues: Glu35 and Asp52 are catalytic
# Fix these positions plus surrounding contacts

result = client.design.proteinmpnn(
    pdb_string=pdb_string,
    num_sequences=16,
    temperature=0.3,          # Moderate diversity
    fixed_positions={
        "A": [35, 52, 53, 57, 62, 63, 101, 108]  # Chain A, catalytic + binding residues
    },
)

# Check that fixed positions were preserved
native_seq = "KVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTDGSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAWVAWRNRCKGTDVQAWIRGCRL"

for i, seq in enumerate(result.sequences):
    # Verify catalytic residues are unchanged
    assert seq.sequence[34] == native_seq[34], "Glu35 was modified!"
    assert seq.sequence[51] == native_seq[51], "Asp52 was modified!"
    print(f"Design {i+1}: recovery={seq.sequence_recovery:.1%}, score={seq.score:.3f}")
    # Show which positions changed
    changes = sum(1 for a, b in zip(seq.sequence, native_seq) if a != b)
    print(f"  {changes} positions redesigned out of {len(native_seq)}")

For multi-chain complexes, you can select which chains to redesign. This is essential for interface design – for example, redesigning a binder protein while holding the target protein fixed.

python
# For a two-chain complex (e.g., antibody-antigen)
# Redesign the antibody (chain H) while keeping the antigen (chain A) fixed

result = client.design.proteinmpnn(
    pdb_string=complex_pdb_string,
    num_sequences=8,
    temperature=0.2,
    design_chains=["H"],         # Only redesign chain H
    fixed_positions={
        "H": [95, 96, 97, 98, 99, 100, 101]  # Fix CDR3 tip residues
    },
)

Temperature: Controlling Sequence Diversity

The temperature parameter is arguably the most important setting in ProteinMPNN. It controls the sampling distribution over amino acids at each position.

  • Temperature 0.1 – Near-greedy sampling. Produces the single most confident sequence. Highest recovery, lowest diversity. Use when you want the most stable possible design.
  • Temperature 0.2–0.3 – Low diversity. Sequences are similar but with occasional substitutions. Good default for engineering projects.
  • Temperature 0.5 – Moderate diversity. Useful for generating a library of variants to screen experimentally.
  • Temperature 0.8–1.0 – High diversity. Sequences diverge significantly from each other and from the native. Use when you specifically need sequence exploration, such as finding orthogonal designs or searching for improved function.
python
# Generate designs at multiple temperatures for comparison
temperatures = [0.1, 0.3, 0.5, 1.0]

for temp in temperatures:
    result = client.design.proteinmpnn(
        pdb_string=pdb_string,
        num_sequences=4,
        temperature=temp,
    )
    avg_recovery = sum(s.sequence_recovery for s in result.sequences) / len(result.sequences)
    avg_score = sum(s.score for s in result.sequences) / len(result.sequences)
    print(f"T={temp:.1f}: avg recovery={avg_recovery:.1%}, avg score={avg_score:.3f}")

A typical result pattern: at T=0.1 you see ~52% recovery and scores around -2.0; at T=1.0, recovery drops to ~25% and scores rise to -1.0. The lower-temperature designs are more likely to fold correctly, but the higher-temperature designs explore more of sequence space and may discover sequences with improved properties (better solubility, thermostability, or function).

Step 3: Interpreting ProteinMPNN Output

Understanding ProteinMPNN's output requires looking beyond just the sequence strings. Three metrics matter most for evaluating designs.

Score (Negative Log-Likelihood)

The score represents how well the designed sequence fits the backbone, according to ProteinMPNN's learned model. Lower (more negative) is better. Scores below -2.0 generally indicate high-confidence designs. Scores above -1.0 suggest the model struggled with the backbone – possibly because it contains unusual structural features or strained geometries. Compare scores across designs to rank them, but do not treat the absolute value as a folding probability.

Sequence Recovery

When redesigning a natural protein, sequence recovery tells you how much the design diverged from the original. Recovery of 50–60% at temperature 0.1 is typical for well-folded proteins. Very high recovery (>70%) suggests the backbone strongly constrains the sequence – often seen in beta-barrels and tightly packed cores. Low recovery (<30%) at low temperature may indicate a problematic backbone with strained or unusual geometry.

Importantly, positions where ProteinMPNN deviates from the native sequence are not necessarily wrong. The model may have found alternative amino acids that are equally or even more compatible with the fold. Many successful engineered proteins have sequences that differ from any natural protein by 40% or more.

Per-Position Confidence

ProteinMPNN outputs a probability distribution over 20 amino acids at each position. Positions where the model assigns >90% probability to a single amino acid are highly constrained by the backbone geometry – these are typically buried core positions or structurally critical glycine/proline residues. Positions with more uniform distributions are surface-exposed and tolerant of substitution. This per-position information is invaluable for identifying which mutations are safe and which are risky.

Step 4: Validation with Round-Trip ESMFold Folding

The most important computational validation for any ProteinMPNN design is round-trip folding: take the designed sequence, predict its structure with ESMFold (or AlphaFold2), and measure how closely the predicted structure matches the original target backbone. If the round-trip RMSD is below 2 Angstroms, the design is very likely to fold correctly in the lab.

python
import scirouter
import time

client = scirouter.SciRouter(api_key="sk-sci-YOUR_KEY")

# Step 1: Design sequences for lysozyme (4LYZ)
with open("4lyz.pdb", "r") as f:
    pdb_string = f.read()

designs = client.design.proteinmpnn(
    pdb_string=pdb_string,
    num_sequences=8,
    temperature=0.2,
)

# Step 2: Fold each designed sequence with ESMFold
validated = []
for i, seq in enumerate(designs.sequences):
    print(f"\nValidating design {i+1}/{len(designs.sequences)}...")

    fold_result = client.proteins.fold(
        sequence=seq.sequence,
        model="esmfold",
    )

    # Poll until complete
    job = fold_result
    while job.status in ("pending", "running"):
        time.sleep(3)
        job = client.proteins.get_job(fold_result.job_id)

    if job.status == "completed":
        validated.append({
            "design_index": i + 1,
            "sequence": seq.sequence,
            "mpnn_score": seq.score,
            "recovery": seq.sequence_recovery,
            "plddt": job.mean_plddt,
            "pdb": job.pdb_string,
        })
        print(f"  pLDDT: {job.mean_plddt:.1f}")
        print(f"  MPNN score: {seq.score:.3f}")
        print(f"  Recovery: {seq.sequence_recovery:.1%}")
    else:
        print(f"  Folding failed: {job.status}")

# Step 3: Rank by pLDDT (proxy for fold quality)
validated.sort(key=lambda x: x["plddt"], reverse=True)

print("\n=== Top Designs (ranked by pLDDT) ===")
for v in validated[:3]:
    print(f"  Design {v['design_index']}: pLDDT={v['plddt']:.1f}, "
          f"score={v['mpnn_score']:.3f}, recovery={v['recovery']:.1%}")
Note
Round-trip validation is the single most important quality check for ProteinMPNN designs. Designs with ESMFold pLDDT above 80 and backbone RMSD below 2 Angstroms are strong candidates for experimental testing. Designs with pLDDT below 60 should be discarded.

The round-trip validation serves as a computational filter that dramatically reduces the number of designs you need to test experimentally. A typical workflow generates 50–100 ProteinMPNN designs, validates all of them with ESMFold in under an hour, and selects the top 5–10 for synthesis and experimental characterization. This is orders of magnitude cheaper than synthesizing all candidates blindly.

Step 5: Stability and Solubility Prediction

A sequence that folds correctly is necessary but not sufficient. For practical applications, the designed protein must also be stable (resistant to thermal denaturation) and soluble (expressed in a soluble form rather than forming inclusion bodies). SciRouter provides stability and solubility prediction endpoints that complement the ProteinMPNN + ESMFold workflow.

python
# Predict stability and solubility for top designs
for v in validated[:5]:
    # Stability prediction (ThermoMPNN-based)
    stability = client.design.stability(
        sequence=v["sequence"],
        pdb_string=v["pdb"],
    )

    # Solubility prediction (SoluProt-based)
    solubility = client.design.solubility(
        sequence=v["sequence"],
    )

    print(f"\nDesign {v['design_index']}:")
    print(f"  Stability (ddG): {stability.predicted_ddg:.2f} kcal/mol")
    print(f"  Solubility score: {solubility.solubility_score:.2f}")
    print(f"  Soluble: {solubility.is_soluble}")
    print(f"  pLDDT: {v['plddt']:.1f}")

Stability is predicted as a delta-delta-G (ddG) value – negative values indicate stabilizing mutations relative to the wild type. Solubility is reported as a probability score between 0 and 1, with values above 0.5 predicting soluble expression. The ideal design has high pLDDT (above 80), negative ddG (more stable than wild type), and solubility score above 0.6.

Complete Engineering Workflow: From PDB to Validated Designs

Here is the complete end-to-end workflow that combines all the steps above into a single Python script. This is the pattern used in SciRouter's Protein Engineering Lab.

python
import scirouter
import time
import json

client = scirouter.SciRouter(api_key="sk-sci-YOUR_KEY")

# ──────────────────────────────────────
# Configuration
# ──────────────────────────────────────
PDB_ID = "4LYZ"                       # Lysozyme
DESIGN_CHAINS = ["A"]                  # Redesign chain A
FIXED_RESIDUES = {"A": [35, 52]}       # Keep catalytic Glu35 and Asp52
NUM_DESIGNS = 16                       # Generate 16 variants
TEMPERATURE = 0.2                      # Low-moderate diversity
PLDDT_THRESHOLD = 75.0                 # Minimum pLDDT for validation
SOLUBILITY_THRESHOLD = 0.5             # Minimum solubility score

# ──────────────────────────────────────
# Step 1: Get backbone structure
# ──────────────────────────────────────
import requests
pdb_url = f"https://files.rcsb.org/download/{PDB_ID}.pdb"
pdb_string = requests.get(pdb_url).text
print(f"[1/5] Downloaded {PDB_ID} backbone")

# ──────────────────────────────────────
# Step 2: Design sequences with ProteinMPNN
# ──────────────────────────────────────
designs = client.design.proteinmpnn(
    pdb_string=pdb_string,
    num_sequences=NUM_DESIGNS,
    temperature=TEMPERATURE,
    design_chains=DESIGN_CHAINS,
    fixed_positions=FIXED_RESIDUES,
)
print(f"[2/5] ProteinMPNN generated {len(designs.sequences)} sequences")

# ──────────────────────────────────────
# Step 3: Validate with round-trip ESMFold
# ──────────────────────────────────────
validated = []
for i, seq in enumerate(designs.sequences):
    fold = client.proteins.fold(sequence=seq.sequence, model="esmfold")
    job = fold
    while job.status in ("pending", "running"):
        time.sleep(3)
        job = client.proteins.get_job(fold.job_id)

    if job.status == "completed" and job.mean_plddt >= PLDDT_THRESHOLD:
        validated.append({
            "index": i + 1,
            "sequence": seq.sequence,
            "mpnn_score": seq.score,
            "recovery": seq.sequence_recovery,
            "plddt": job.mean_plddt,
            "pdb": job.pdb_string,
        })

print(f"[3/5] {len(validated)}/{NUM_DESIGNS} passed pLDDT threshold ({PLDDT_THRESHOLD})")

# ──────────────────────────────────────
# Step 4: Predict stability and solubility
# ──────────────────────────────────────
final_candidates = []
for v in validated:
    stability = client.design.stability(sequence=v["sequence"], pdb_string=v["pdb"])
    solubility = client.design.solubility(sequence=v["sequence"])

    if solubility.solubility_score >= SOLUBILITY_THRESHOLD:
        v["ddg"] = stability.predicted_ddg
        v["solubility"] = solubility.solubility_score
        final_candidates.append(v)

print(f"[4/5] {len(final_candidates)}/{len(validated)} passed solubility threshold ({SOLUBILITY_THRESHOLD})")

# ──────────────────────────────────────
# Step 5: Rank and report
# ──────────────────────────────────────
# Composite score: high pLDDT, low ddG (stabilizing), high solubility
for c in final_candidates:
    c["composite"] = c["plddt"] / 100 + (-c["ddg"]) + c["solubility"]

final_candidates.sort(key=lambda x: x["composite"], reverse=True)

print(f"\n[5/5] Final ranked candidates:")
print(f"{'Rank':<6} {'pLDDT':<8} {'ddG':<10} {'Solub':<8} {'Recovery':<10} {'Sequence (first 50)'}")
print("-" * 100)
for rank, c in enumerate(final_candidates[:5], 1):
    print(f"{rank:<6} {c['plddt']:<8.1f} {c['ddg']:<10.2f} {c['solubility']:<8.2f} "
          f"{c['recovery']:<10.1%} {c['sequence'][:50]}...")

# Save results
with open(f"{PDB_ID}_designs.json", "w") as f:
    json.dump(final_candidates, f, indent=2, default=str)
print(f"\nResults saved to {PDB_ID}_designs.json")

This pipeline typically takes 5–15 minutes for 16 designs, depending on ESMFold queue times. The output is a ranked list of experimentally testable sequences, each validated for foldability, stability, and solubility. For a 129-residue protein like lysozyme, you can expect 8–12 of the 16 designs to pass all filters – a far better starting point than random mutagenesis.

Applications: What You Can Build with ProteinMPNN

Enzyme Redesign for Industrial Applications

Industrial enzymes often need to be stabilized for non-physiological conditions: high temperature (laundry detergents), extreme pH (biofuel production), or organic solvents (chemical synthesis). ProteinMPNN can redesign the surface and core of an enzyme for improved thermostability while preserving the active site geometry through fixed-position constraints. A common strategy is to run ProteinMPNN at temperature 0.1 with the catalytic residues fixed, then screen the top designs for thermal denaturation temperature using stability prediction before committing to experimental characterization.

Binder Design

Designing proteins that bind specific targets – for diagnostics, therapeutics, or biosensors – is one of the most impactful applications of inverse folding. The workflow starts with a target protein structure (from the PDB or ESMFold) and a computationally generated binder backbone (from RFDiffusion with hotspot conditioning). ProteinMPNN then designs the binder sequence to form favorable contacts at the interface. Multi-chain mode is essential here: hold the target chain fixed and redesign only the binder chain.

Protein Switches and Sensors

Conformational protein switches – proteins that change shape in response to a signal molecule – are valuable for biosensing and synthetic biology. ProteinMPNN can design sequences that favor one conformation while still being accessible to a second conformation. The approach involves running ProteinMPNN on both conformational states and selecting sequences that score well for both, indicating they are compatible with the conformational change.

Solubility and Expression Optimization

Many otherwise useful proteins are difficult to express in recombinant systems because they aggregate or form inclusion bodies. ProteinMPNN can redesign surface residues to improve solubility while leaving the core and functional sites untouched. By fixing all core residues (those with <20% solvent accessibility) and redesigning only surface positions, you maintain the fold while optimizing for expression. Combining this with SciRouter's solubility prediction endpoint provides a computational screen before any wet-lab work.

Tips for Getting Better ProteinMPNN Designs

  • Clean your input PDB. Remove water molecules, ligands, and alternate conformations before submitting to ProteinMPNN. The model reads all ATOM records, and heteroatoms can confuse the message-passing network.
  • Use multiple temperatures. Generate designs at T=0.1, T=0.3, and T=0.5. The best designs often come from moderate temperatures where the model balances confidence with exploration.
  • Fix known important residues. Always fix catalytic residues, disulfide cysteines, and experimentally validated binding contacts. Let ProteinMPNN redesign everything else.
  • Generate more designs than you need. Generate 50–100 designs and filter computationally. It costs minutes of compute time but saves months of failed experiments.
  • Validate every design with round-trip folding. This is not optional. A design that does not round-trip with ESMFold (pLDDT < 70) is very unlikely to fold correctly in the lab.
  • Check for problematic mutations. After ProteinMPNN outputs a sequence, scan for proline insertions in alpha-helices (helix-breaking), glycine removals in tight turns (strain), and hydrophobic-to-charged mutations in the core (destabilizing).

Try It Now: Design Your First Protein

You can run ProteinMPNN right now through SciRouter's Protein Engineering Lab. Upload a PDB structure, set your parameters, and get designed sequences in seconds – with automatic round-trip ESMFold validation and stability/solubility scoring.

For programmatic access, install the Python SDK with pip install scirouter and use the workflow patterns shown in this tutorial. The free tier includes 5,000 API calls per month, which is enough for dozens of design-validate cycles. The API documentation has complete reference for all parameters.

Protein design has shifted from a specialized computational biology technique to an accessible engineering discipline. With ProteinMPNN for sequence design, ESMFold for validation, and stability/solubility prediction for triage, you can go from a backbone structure to experimentally testable candidates in a single afternoon. The proteins that nature did not make are now within reach.

Frequently Asked Questions

What is the difference between ProteinMPNN and RFDiffusion?

ProteinMPNN and RFDiffusion solve different problems. ProteinMPNN takes a fixed backbone structure and designs amino acid sequences that will fold into it (inverse folding). RFDiffusion generates entirely new backbone structures from scratch using diffusion models. In practice, they are often used together: RFDiffusion creates a novel backbone, then ProteinMPNN designs a sequence for it. ProteinMPNN is also used standalone when you want to redesign an existing protein while preserving its fold.

What temperature should I use for ProteinMPNN?

Temperature controls sequence diversity. At temperature 0.1, ProteinMPNN produces very similar sequences with high confidence — ideal for maximizing sequence recovery and stability. At temperature 1.0, you get much more diverse sequences, useful for exploring sequence space and finding novel solutions. The default of 0.1 works well for most applications. Use 0.3-0.5 when you want moderate diversity, and 0.8-1.0 when you specifically need sequence variety for experimental screening.

Can ProteinMPNN design multi-chain complexes?

Yes. ProteinMPNN handles multi-chain inputs by reading all chains from the PDB structure and designing sequences that are compatible with the inter-chain interfaces. You can specify which chains to redesign and which to hold fixed. This is particularly useful for designing one half of a protein-protein interface while keeping the binding partner constant.

How do I validate ProteinMPNN designs experimentally?

The standard validation pipeline is: (1) computationally validate with round-trip folding using ESMFold or AlphaFold — if the predicted structure matches the target backbone with RMSD under 2 Angstroms, the design is likely foldable; (2) express the designed sequence in E. coli or cell-free systems; (3) characterize folding by circular dichroism and size-exclusion chromatography; (4) solve the crystal structure or cryo-EM structure to confirm the atomic-level fold matches the design.

What is sequence recovery and why does it matter?

Sequence recovery is the percentage of positions where ProteinMPNN's designed sequence matches the original native sequence. For natural proteins, ProteinMPNN achieves about 50-55% sequence recovery, compared to roughly 30% for Rosetta fixed-backbone design. High sequence recovery indicates the model has learned genuine sequence-structure relationships. However, low recovery at specific positions is not necessarily bad — it may indicate alternative amino acids that are equally compatible with the fold.

Does ProteinMPNN work for membrane proteins?

ProteinMPNN was primarily trained on soluble protein structures from the PDB, so its performance on membrane proteins is less well-validated. It can design sequences for membrane protein backbones, but it may not properly account for the lipid bilayer environment — for example, it might place polar residues in transmembrane helices. For membrane proteins, consider using specialized tools or applying manual constraints to fix lipid-facing residues as hydrophobic amino acids.

Try It Free

No Login Required

Try this yourself

500 free credits. No credit card required.