What is protein function prediction?

Protein function prediction is the task of determining what a protein does in a cell — its molecular function, the biological processes it participates in, and where in the cell it operates. This is formalized using Gene Ontology (GO) terms, a standardized vocabulary of thousands of functional labels organized into three categories: molecular function, biological process, and cellular component.

What is BioReason-Pro?

BioReason-Pro is a multimodal protein function prediction model that combines ESM-3 protein language model embeddings with Gene Ontology graph structure using a Qwen3-4B reasoning backbone. Unlike black-box classifiers, it produces natural-language reasoning traces that explain why each GO term was predicted, making its predictions interpretable and auditable.

How does ProteInfer predict function?

ProteInfer is a deep learning model from Google Research that uses dilated convolutional neural networks (CNNs) to predict GO terms directly from amino acid sequences. It was trained on millions of UniProt sequences and predicts function as a multi-label classification problem, outputting confidence scores for each GO term.

Is InterPro a prediction tool or a database?

InterPro is primarily a database and annotation platform. It integrates protein signatures (domains, families, motifs) from 13 member databases including Pfam, PROSITE, and CDD. It annotates proteins by matching sequences against known signatures using InterProScan, then transfers GO terms associated with those signatures. It does not use machine learning for prediction.

Can BioReason-Pro predict function for novel proteins?

Yes. Because BioReason-Pro uses learned representations from ESM-3 rather than database lookups, it can predict function for proteins with no known homologs or domain matches. This makes it especially valuable for metagenomic sequences, orphan proteins, and de novo designed proteins where database methods fail.

Which tool is most accurate for GO term prediction?

Accuracy depends on the protein. InterPro is highly precise for well-characterized protein families because it transfers experimentally validated annotations. BioReason-Pro achieves strong performance across all three GO categories and excels on novel or poorly characterized proteins. ProteInfer provides good coverage but may lack precision on specialized functions. For critical annotations, combining multiple tools is recommended.

Blog — Scientific Computing Tutorials, Guides & Comparisons

The Protein Function Prediction Problem

Knowing a protein's sequence or structure is only half the story. The real question is: what does this protein do? Protein function prediction assigns Gene Ontology (GO) terms to proteins, covering three dimensions: molecular function (what it does biochemically), biological process (what pathway it participates in), and cellular component (where in the cell it acts). With over 50,000 GO terms in the ontology, this is one of the most complex classification problems in biology.

Three tools represent fundamentally different approaches to this problem: BioReason-Pro uses multimodal deep learning with reasoning, ProteInfer uses convolutional neural networks, and InterPro uses curated signature databases. Each has distinct strengths, and the right choice depends on your protein, your data, and whether you need explanations alongside predictions.

BioReason-Pro: Multimodal Reasoning for Function Prediction

BioReason-Pro is a next-generation protein function prediction model that combines three information sources: ESM-3 protein language model embeddings (capturing sequence and structural features), Gene Ontology graph structure (capturing relationships between functional terms), and a Qwen3-4B language model backbone that reasons over both to produce predictions with natural-language explanations.

The model was trained on the wanglab/bioreason-pro-rl weights using reinforcement learning from biological feedback, aligning its predictions with experimental evidence from UniProt and the GO Annotation Database. This training approach means BioReason-Pro does not just classify — it reasons about why a protein likely has a particular function.

How BioReason-Pro Works

Input: Protein amino acid sequence or DNA coding sequence (BioReason natively processes DNA via Nucleotide Transformer)
Embedding: ESM-3 generates per-residue and pooled sequence embeddings capturing evolutionary and structural information
GO Graph Integration: The model incorporates the hierarchical structure of the Gene Ontology, understanding that predicting a child term implies its parent terms
Reasoning: Qwen3-4B generates a chain-of-thought reasoning trace linking sequence features to predicted GO terms
Output: Ranked GO terms with confidence scores across all three categories, plus a natural-language explanation

Tip

The reasoning trace is what sets BioReason-Pro apart. Instead of a black-box confidence score, you get an explanation like "The sequence contains a conserved Rossmann fold motif at positions 45-120 and a NAD-binding signature, consistent with oxidoreductase activity (GO:0016491)." This makes predictions auditable and publishable.

ProteInfer: CNN-Based GO Term Classification

ProteInfer, developed by Google Research, takes a pure deep learning approach to protein function prediction. It uses dilated convolutional neural networks (CNNs) to process raw amino acid sequences and predict GO terms as a multi-label classification task. The model was trained on millions of sequences from UniProt with experimentally validated GO annotations.

How ProteInfer Works

Input: Amino acid sequence only
Architecture: Dilated CNNs with increasing receptive fields capture local and long-range sequence patterns
Training: Multi-label classification on UniProt sequences with GO annotations, using label propagation through the GO hierarchy
Output: Per-GO-term confidence scores for molecular function, biological process, and cellular component

ProteInfer is fast and requires no database search, making it practical for large-scale annotation. However, it provides no explanation for its predictions — you receive confidence scores but no insight into which sequence features drove the classification. It also has a fixed vocabulary of GO terms from its training set, meaning it cannot predict newly added ontology terms without retraining.

InterPro: Signature-Based Annotation Database

InterPro is not a machine learning model — it is a comprehensive database that integrates protein signatures from 13 member databases including Pfam, PROSITE, PRINTS, CDD, SMART, and others. Each signature represents a known protein domain, family, or functional motif. InterPro maps these signatures to GO terms based on curated experimental evidence.

How InterPro Works

Input: Protein sequence submitted to InterProScan
Matching: The sequence is searched against all member database signatures using profile HMMs, regular expressions, and fingerprints
Annotation Transfer: Matched signatures carry pre-assigned GO terms that are transferred to the query protein
Coverage: InterPro covers over 240,000 protein entries across 47,000+ signatures, with annotations for the majority of UniProt

InterPro's strength is precision: if a protein matches a well-characterized domain signature, the transferred GO terms are backed by decades of experimental evidence. Its weakness is coverage of novel proteins — if no signature matches, InterPro returns nothing. This is a fundamental limitation for metagenomic data, orphan proteins, and computationally designed sequences.

Head-to-Head Comparison

Prediction Approach

BioReason-Pro: Hybrid ML — protein language model embeddings + GO graph structure + language model reasoning. Learns from both sequence patterns and ontology relationships.
ProteInfer: Pure ML — dilated CNNs trained end-to-end on sequence-to-GO classification. No external knowledge beyond training data.
InterPro: Database lookup — signature matching against curated profiles. No learning at inference time; all knowledge is pre-computed.

GO Term Coverage

BioReason-Pro: Predicts across the full GO hierarchy for all three categories. Can assign terms not seen during training by leveraging GO graph structure.
ProteInfer: Limited to GO terms present in training data. Strong on common terms, weaker on rare or recently added terms.
InterPro: Covers GO terms mapped to its 47,000+ signatures. Very strong for characterized protein families but zero coverage for novel sequences with no signature matches.

Speed and Scalability

BioReason-Pro: 10 to 30 seconds per protein on GPU (A24 class). The reasoning trace adds overhead but provides interpretability. Practical for thousands of proteins via API.
ProteInfer: Sub-second inference on GPU. Highly scalable for proteome-wide annotation. Lightweight model suitable for batch processing.
InterPro: InterProScan takes 1 to 10 minutes per protein depending on sequence length and database configuration. Bottlenecked by HMM searches against 13 databases. Batch processing requires significant compute.

Interpretability

BioReason-Pro: Full reasoning traces. Each prediction includes a natural-language explanation linking sequence features to functional annotations. Suitable for publications and regulatory submissions.
ProteInfer: Confidence scores only. No explanation of which sequence features drove the prediction. Requires post-hoc analysis (e.g., attention visualization) for interpretability.
InterPro: Implicit interpretability through domain matches. You know which signature matched, but the biological reasoning connecting domain to function is external to the tool.

Novel Protein Handling

BioReason-Pro: Strong. ESM-3 embeddings capture general protein properties even for sequences with no known homologs. The model generalizes to orphan and designed proteins.
ProteInfer: Moderate. CNNs can identify local sequence patterns in novel proteins, but prediction quality degrades when the protein is far from the training distribution.
InterPro: Weak for truly novel proteins. If no signature matches, no annotation is returned. This is the primary failure mode for metagenomic and synthetic biology applications.

When to Use Each Tool

Choose BioReason-Pro When

You need to understand why a function was predicted, not just what was predicted
You are working with novel, orphan, or computationally designed proteins
You need predictions across all three GO categories with reasoning traces
Results will appear in publications, patents, or regulatory filings where interpretability matters
You are building AI agent workflows that need structured reasoning about protein function

Choose ProteInfer When

You need sub-second inference for proteome-scale annotation
Speed matters more than interpretability
You are doing initial high-throughput screening and will validate hits with other methods
Your proteins are within the distribution of well-characterized UniProt sequences

Choose InterPro When

Your proteins belong to well-characterized families with known domain architectures
You need annotations backed by curated experimental evidence
You are annotating a genome where most genes have homologs in reference databases
Regulatory or compliance requirements demand annotations traceable to specific database entries

Note

For the highest confidence annotations, use multiple tools. Run BioReason-Pro for reasoning-backed predictions, cross-reference with InterPro for domain-based evidence, and use ProteInfer as a fast tiebreaker. SciRouter's MCP integration makes it straightforward to chain these analyses in an AI agent workflow.

Using BioReason-Pro Through the SciRouter API

SciRouter provides BioReason-Pro as a hosted GPU endpoint at /v1/proteins/function. Submit a protein or DNA sequence, and receive GO term predictions with confidence scores and a reasoning trace explaining each prediction. Here is a working example:

Predict protein function with BioReason-Pro via SciRouter

import requests
import time

API_KEY = "sk-sci-your-api-key"
BASE = "https://api.scirouter.ai/v1"
headers = {"Authorization": f"Bearer {API_KEY}"}

# Submit a function prediction job
response = requests.post(
    f"{BASE}/proteins/function",
    headers=headers,
    json={
        "sequence": "MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH",
        "include_reasoning": True,
        "organism": "human"
    }
)
job = response.json()
job_id = job["data"]["job_id"]
print(f"Job submitted: {job_id}")

# Poll for results
while True:
    result = requests.get(
        f"{BASE}/proteins/function/{job_id}",
        headers=headers
    ).json()
    if result["data"]["status"] == "completed":
        predictions = result["data"]["predictions"]
        reasoning = result["data"]["reasoning"]

        print(f"\nTop GO terms predicted:")
        for pred in predictions[:5]:
            print(f"  {pred['go_id']} - {pred['name']} "
                  f"({pred['category']}, confidence: {pred['confidence']:.2f})")

        print(f"\nReasoning trace:\n{reasoning}")
        break
    elif result["data"]["status"] == "failed":
        print(f"Job failed: {result['data']['error']}")
        break
    time.sleep(3)

Using the Python SDK

BioReason-Pro via the SciRouter SDK

from scirouter import SciRouter

client = SciRouter(api_key="sk-sci-your-api-key")

# Predict function with reasoning
result = client.proteins.function(
    sequence="MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSH",
    include_reasoning=True,
    organism="human"
)

# Access predictions
for pred in result.predictions[:5]:
    print(f"{pred.go_id}: {pred.name} ({pred.category}) — {pred.confidence:.2f}")

# Access the reasoning trace
print(f"\nReasoning: {result.reasoning}")

BioReason-Pro in SciRouter Lab Pipelines

BioReason-Pro is integrated into several of SciRouter's end-to-end lab pipelines, where it provides the function annotation step:

Drug Discovery Lab (/v1/labs/discover/evaluate) — BioReason-Pro annotates the target protein's function to contextualize docking and ADMET results
Protein Engineering Lab (/v1/labs/engineer/optimize) — Function prediction before and after sequence design verifies that engineered variants retain desired activity
Antibody Design Lab (/v1/labs/antibody/discover) — Antigen function annotation guides CDR design toward therapeutically relevant epitopes
Molecular Design Lab (/v1/labs/moldesign/generate) — Target function annotation ensures generated molecules are relevant to the biological mechanism

Practical Recommendations

Protein function prediction is not a solved problem — no single tool covers every case perfectly. Here is a practical decision framework:

For well-studied organisms (human, mouse, E. coli): Start with InterPro for high-confidence domain annotations, then use BioReason-Pro for proteins with incomplete or missing InterPro matches
For metagenomic or environmental sequences: Use BioReason-Pro as the primary tool — many of these proteins have no matches in signature databases
For high-throughput screening: Use ProteInfer for fast initial annotation, then validate top hits with BioReason-Pro reasoning traces
For publication or regulatory work: Use BioReason-Pro reasoning traces alongside InterPro domain evidence for the strongest annotation support
For de novo designed proteins: BioReason-Pro is the only option among these three that can meaningfully predict function for synthetic sequences with no natural homologs

Try ESMFold for structure prediction alongside function annotation, or explore our ProteinMPNN tutorial to design sequences for predicted structures. For an end-to-end workflow combining structure, function, and design, see our AI antibody design guide.

Ready to predict protein function with reasoning? Sign up for a free SciRouter API key to access BioReason-Pro at /v1/proteins/function with 500 free credits per month — enough for 100 function predictions with full reasoning traces.

BioReason-Pro vs ProteInfer vs InterPro: Protein Function Prediction Compared