The Protein Function Prediction Problem
Knowing a protein's sequence or structure is only half the story. The real question is: what does this protein do? Protein function prediction assigns Gene Ontology (GO) terms to proteins, covering three dimensions: molecular function (what it does biochemically), biological process (what pathway it participates in), and cellular component (where in the cell it acts). With over 50,000 GO terms in the ontology, this is one of the most complex classification problems in biology.
Three tools represent fundamentally different approaches to this problem: BioReason-Pro uses multimodal deep learning with reasoning, ProteInfer uses convolutional neural networks, and InterPro uses curated signature databases. Each has distinct strengths, and the right choice depends on your protein, your data, and whether you need explanations alongside predictions.
BioReason-Pro: Multimodal Reasoning for Function Prediction
BioReason-Pro is a next-generation protein function prediction model that combines three information sources: ESM-3 protein language model embeddings (capturing sequence and structural features), Gene Ontology graph structure (capturing relationships between functional terms), and a Qwen3-4B language model backbone that reasons over both to produce predictions with natural-language explanations.
The model was trained on the wanglab/bioreason-pro-rl weights using reinforcement learning from biological feedback, aligning its predictions with experimental evidence from UniProt and the GO Annotation Database. This training approach means BioReason-Pro does not just classify — it reasons about why a protein likely has a particular function.
How BioReason-Pro Works
- Input: Protein amino acid sequence or DNA coding sequence (BioReason natively processes DNA via Nucleotide Transformer)
- Embedding: ESM-3 generates per-residue and pooled sequence embeddings capturing evolutionary and structural information
- GO Graph Integration: The model incorporates the hierarchical structure of the Gene Ontology, understanding that predicting a child term implies its parent terms
- Reasoning: Qwen3-4B generates a chain-of-thought reasoning trace linking sequence features to predicted GO terms
- Output: Ranked GO terms with confidence scores across all three categories, plus a natural-language explanation
ProteInfer: CNN-Based GO Term Classification
ProteInfer, developed by Google Research, takes a pure deep learning approach to protein function prediction. It uses dilated convolutional neural networks (CNNs) to process raw amino acid sequences and predict GO terms as a multi-label classification task. The model was trained on millions of sequences from UniProt with experimentally validated GO annotations.
How ProteInfer Works
- Input: Amino acid sequence only
- Architecture: Dilated CNNs with increasing receptive fields capture local and long-range sequence patterns
- Training: Multi-label classification on UniProt sequences with GO annotations, using label propagation through the GO hierarchy
- Output: Per-GO-term confidence scores for molecular function, biological process, and cellular component
ProteInfer is fast and requires no database search, making it practical for large-scale annotation. However, it provides no explanation for its predictions — you receive confidence scores but no insight into which sequence features drove the classification. It also has a fixed vocabulary of GO terms from its training set, meaning it cannot predict newly added ontology terms without retraining.
InterPro: Signature-Based Annotation Database
InterPro is not a machine learning model — it is a comprehensive database that integrates protein signatures from 13 member databases including Pfam, PROSITE, PRINTS, CDD, SMART, and others. Each signature represents a known protein domain, family, or functional motif. InterPro maps these signatures to GO terms based on curated experimental evidence.
How InterPro Works
- Input: Protein sequence submitted to InterProScan
- Matching: The sequence is searched against all member database signatures using profile HMMs, regular expressions, and fingerprints
- Annotation Transfer: Matched signatures carry pre-assigned GO terms that are transferred to the query protein
- Coverage: InterPro covers over 240,000 protein entries across 47,000+ signatures, with annotations for the majority of UniProt
InterPro's strength is precision: if a protein matches a well-characterized domain signature, the transferred GO terms are backed by decades of experimental evidence. Its weakness is coverage of novel proteins — if no signature matches, InterPro returns nothing. This is a fundamental limitation for metagenomic data, orphan proteins, and computationally designed sequences.
Head-to-Head Comparison
Prediction Approach
- BioReason-Pro: Hybrid ML — protein language model embeddings + GO graph structure + language model reasoning. Learns from both sequence patterns and ontology relationships.
- ProteInfer: Pure ML — dilated CNNs trained end-to-end on sequence-to-GO classification. No external knowledge beyond training data.
- InterPro: Database lookup — signature matching against curated profiles. No learning at inference time; all knowledge is pre-computed.
GO Term Coverage
- BioReason-Pro: Predicts across the full GO hierarchy for all three categories. Can assign terms not seen during training by leveraging GO graph structure.
- ProteInfer: Limited to GO terms present in training data. Strong on common terms, weaker on rare or recently added terms.
- InterPro: Covers GO terms mapped to its 47,000+ signatures. Very strong for characterized protein families but zero coverage for novel sequences with no signature matches.
Speed and Scalability
- BioReason-Pro: 10 to 30 seconds per protein on GPU (A24 class). The reasoning trace adds overhead but provides interpretability. Practical for thousands of proteins via API.
- ProteInfer: Sub-second inference on GPU. Highly scalable for proteome-wide annotation. Lightweight model suitable for batch processing.
- InterPro: InterProScan takes 1 to 10 minutes per protein depending on sequence length and database configuration. Bottlenecked by HMM searches against 13 databases. Batch processing requires significant compute.
Interpretability
- BioReason-Pro: Full reasoning traces. Each prediction includes a natural-language explanation linking sequence features to functional annotations. Suitable for publications and regulatory submissions.
- ProteInfer: Confidence scores only. No explanation of which sequence features drove the prediction. Requires post-hoc analysis (e.g., attention visualization) for interpretability.
- InterPro: Implicit interpretability through domain matches. You know which signature matched, but the biological reasoning connecting domain to function is external to the tool.
Novel Protein Handling
- BioReason-Pro: Strong. ESM-3 embeddings capture general protein properties even for sequences with no known homologs. The model generalizes to orphan and designed proteins.
- ProteInfer: Moderate. CNNs can identify local sequence patterns in novel proteins, but prediction quality degrades when the protein is far from the training distribution.
- InterPro: Weak for truly novel proteins. If no signature matches, no annotation is returned. This is the primary failure mode for metagenomic and synthetic biology applications.
When to Use Each Tool
Choose BioReason-Pro When
- You need to understand why a function was predicted, not just what was predicted
- You are working with novel, orphan, or computationally designed proteins
- You need predictions across all three GO categories with reasoning traces
- Results will appear in publications, patents, or regulatory filings where interpretability matters
- You are building AI agent workflows that need structured reasoning about protein function
Choose ProteInfer When
- You need sub-second inference for proteome-scale annotation
- Speed matters more than interpretability
- You are doing initial high-throughput screening and will validate hits with other methods
- Your proteins are within the distribution of well-characterized UniProt sequences
Choose InterPro When
- Your proteins belong to well-characterized families with known domain architectures
- You need annotations backed by curated experimental evidence
- You are annotating a genome where most genes have homologs in reference databases
- Regulatory or compliance requirements demand annotations traceable to specific database entries
Using BioReason-Pro Through the SciRouter API
SciRouter provides BioReason-Pro as a hosted GPU endpoint at /v1/proteins/function. Submit a protein or DNA sequence, and receive GO term predictions with confidence scores and a reasoning trace explaining each prediction. Here is a working example:
import requests
import time
API_KEY = "sk-sci-your-api-key"
BASE = "https://api.scirouter.ai/v1"
headers = {"Authorization": f"Bearer {API_KEY}"}
# Submit a function prediction job
response = requests.post(
f"{BASE}/proteins/function",
headers=headers,
json={
"sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH",
"include_reasoning": True,
"organism": "human"
}
)
job = response.json()
job_id = job["data"]["job_id"]
print(f"Job submitted: {job_id}")
# Poll for results
while True:
result = requests.get(
f"{BASE}/proteins/function/{job_id}",
headers=headers
).json()
if result["data"]["status"] == "completed":
predictions = result["data"]["predictions"]
reasoning = result["data"]["reasoning"]
print(f"\nTop GO terms predicted:")
for pred in predictions[:5]:
print(f" {pred['go_id']} - {pred['name']} "
f"({pred['category']}, confidence: {pred['confidence']:.2f})")
print(f"\nReasoning trace:\n{reasoning}")
break
elif result["data"]["status"] == "failed":
print(f"Job failed: {result['data']['error']}")
break
time.sleep(3)Using the Python SDK
from scirouter import SciRouter
client = SciRouter(api_key="sk-sci-your-api-key")
# Predict function with reasoning
result = client.proteins.function(
sequence="MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH",
include_reasoning=True,
organism="human"
)
# Access predictions
for pred in result.predictions[:5]:
print(f"{pred.go_id}: {pred.name} ({pred.category}) — {pred.confidence:.2f}")
# Access the reasoning trace
print(f"\nReasoning: {result.reasoning}")BioReason-Pro in SciRouter Lab Pipelines
BioReason-Pro is integrated into several of SciRouter's end-to-end lab pipelines, where it provides the function annotation step:
- Drug Discovery Lab (
/v1/labs/discover/evaluate) — BioReason-Pro annotates the target protein's function to contextualize docking and ADMET results - Protein Engineering Lab (
/v1/labs/engineer/optimize) — Function prediction before and after sequence design verifies that engineered variants retain desired activity - Antibody Design Lab (
/v1/labs/antibody/discover) — Antigen function annotation guides CDR design toward therapeutically relevant epitopes - Molecular Design Lab (
/v1/labs/moldesign/generate) — Target function annotation ensures generated molecules are relevant to the biological mechanism
Practical Recommendations
Protein function prediction is not a solved problem — no single tool covers every case perfectly. Here is a practical decision framework:
- For well-studied organisms (human, mouse, E. coli): Start with InterPro for high-confidence domain annotations, then use BioReason-Pro for proteins with incomplete or missing InterPro matches
- For metagenomic or environmental sequences: Use BioReason-Pro as the primary tool — many of these proteins have no matches in signature databases
- For high-throughput screening: Use ProteInfer for fast initial annotation, then validate top hits with BioReason-Pro reasoning traces
- For publication or regulatory work: Use BioReason-Pro reasoning traces alongside InterPro domain evidence for the strongest annotation support
- For de novo designed proteins: BioReason-Pro is the only option among these three that can meaningfully predict function for synthetic sequences with no natural homologs
Try ESMFold for structure prediction alongside function annotation, or explore our ProteinMPNN tutorial to design sequences for predicted structures. For an end-to-end workflow combining structure, function, and design, see our AI antibody design guide.
Ready to predict protein function with reasoning? Sign up for a free SciRouter API key to access BioReason-Pro at /v1/proteins/function with 500 free credits per month — enough for 100 function predictions with full reasoning traces.