Which model is currently the best for variant effect prediction?

For a single overall answer, Evo 2 leads on most published zero-shot benchmarks thanks to its scale and long context. AlphaGenome is competitive on regulatory regions and tends to be more stable. DNABERT is best thought of as a baseline in 2026.

Do I need a different model for coding vs non-coding variants?

Not necessarily. Evo 2 handles both because its autoregressive objective scores any nucleotide change uniformly. For strictly coding variants, ESM-2 on the translated protein is often cheaper and equally accurate.

Can I combine scores from multiple models?

Yes, and it often helps. A simple average or a logistic regression over Evo 2, AlphaGenome, and a protein-level score tends to outperform any single model on difficult variants.

Are any of these clinical-grade?

None of them are FDA-cleared for clinical variant interpretation. They produce research-grade scores that complement curated databases like ClinVar and orthogonal methods. Clinical use still requires human review.

Which one is cheapest to run at scale?

DNABERT, by a wide margin — it is orders of magnitude smaller than Evo 2. AlphaGenome is somewhere in between. Cost is a real consideration when scoring millions of variants in a population study.

Can I run all three from one API?

Evo 2 is available through the SciRouter DNA Lab endpoint. Other models can be swapped in through the same gateway pattern once their weights are integrated. Check the tools index for the current list.

Evo 2 vs AlphaGenome vs DNABERT: Which Variant Effect Predictor Wins?

Variant effect prediction — deciding whether a DNA change is functional or benign — is one of the cleanest benchmarks for DNA foundation models. Three models define the current landscape: Evo 2 from the Arc Institute, AlphaGenome from DeepMind, and DNABERT, the older baseline that anchored early work on DNA transformers. This article lines them up side by side so you can pick the right one for your project.

The short version: Evo 2 leads on raw zero-shot capability thanks to its scale and context length, AlphaGenome is strong on regulatory regions and slightly more stable, and DNABERT is mostly a historical baseline now. All three make sense in different contexts, and sometimes combining them produces the best results.

Note

Variant effect benchmarks are a minefield. Reported numbers depend heavily on whether the test set overlaps with training data, how pathogenic variants are defined, and whether the benchmark rewards calibration or just ranking. Treat head-to-head numbers with appropriate caution.

Quick summary of the three contenders

Evo 2

Author: Arc Institute.
Training corpus: about 9 trillion base pairs across bacteria, archaea, phages, and eukaryotes.
Context window: close to one million nucleotide tokens.
Architecture: autoregressive transformer-style model with efficient attention.
Parameter scale: tens of billions.

AlphaGenome

Author: Google DeepMind.
Training corpus: large aggregated genomics data with strong emphasis on regulatory annotation tracks.
Context window: tens to hundreds of thousands of base pairs.
Architecture: transformer-based, with supervision from functional genomics tracks as well as sequence alone.
Parameter scale: large, comparable to other DeepMind biology models.

DNABERT

Author: academic research group, originally 2021.
Training corpus: human genome at k-mer resolution.
Context window: a few thousand base pairs.
Architecture: BERT-style masked language model on k-mer tokens.
Parameter scale: hundreds of millions.

Head-to-head: scale and context

Scale is the most visible difference between these three models, and it matters more than most other architectural choices. Evo 2 sits in a different league on both corpus size and parameter count. AlphaGenome is a serious large model but trained with a different emphasis. DNABERT is a 2021-era baseline that was never designed to compete on scale.

Context length follows a similar pattern. Evo 2's million-token context means it can see an entire gene plus its regulatory landscape in a single forward pass. AlphaGenome handles long contexts well, though not at Evo 2's extreme. DNABERT is stuck at a few thousand base pairs, which structurally prevents it from capturing long-range regulatory effects.

Zero-shot variant effect prediction

Across published and community benchmarks, Evo 2 generally leads on zero-shot variant scoring. Its likelihood-ratio approach is simple, well-calibrated, and benefits directly from the scale of the training data. For a random variant pulled from ClinVar or a deep mutational scan, Evo 2's scores correlate more strongly with functional ground truth than either of the other two.

AlphaGenome is close behind in most benchmarks and sometimes pulls ahead on regulatory variants specifically. The supervision from functional genomics tracks during training gives it a sharper signal for enhancer and promoter variants in particular.

DNABERT trails on zero-shot variant scoring for essentially all tasks. It was not trained to produce calibrated likelihood ratios, and its small context makes long-range effects invisible. It remains useful as a cheap baseline or when you only care about short k-mer features.

Regulatory elements and functional annotation

For annotating regulatory elements — promoters, enhancers, insulators — AlphaGenome has an edge because it was explicitly trained to predict functional tracks like chromatin accessibility and histone marks. Evo 2 can do this implicitly through its per-position likelihoods, and it does it well, but AlphaGenome tends to produce crisper calls when the task is strictly regulatory annotation.

Generative sequence design

Evo 2 wins on generative design because it is autoregressive. Sampling new sequences conditioned on genomic context is a natural operation for an autoregressive model. AlphaGenome is less commonly used for generation. DNABERT is not a generative model in any practical sense.

Compute cost and accessibility

Compute cost is a real consideration, especially for large population studies where you might want to score millions of variants:

DNABERT is cheap. You can run it on a single GPU and score millions of short variants quickly.
AlphaGenome is expensive but tractable on a handful of GPUs for typical study sizes.
Evo 2 is the most expensive, especially at full context length. Hosted APIs make this manageable, but at-scale variant scoring still requires planning.

Practical decision guide

Pick Evo 2 when…

You need zero-shot variant scores anywhere in the genome.
Long-range regulatory context matters for your variants.
You want generative sequence design as part of the workflow.
You are already using a hosted API and the model cost is abstracted away.

Pick AlphaGenome when…

Your focus is specifically on regulatory elements.
You want a model that was trained with functional annotation tracks as supervision.
Stability and calibration matter as much as raw ranking performance.

Pick DNABERT when…

You need a cheap baseline to compare against.
Your task is narrow enough that a k-mer-level model is sufficient.
You lack GPU resources for the larger models.

Warning

For any serious variant interpretation project, combine scores from multiple models rather than relying on any single one. A simple ensemble of Evo 2 plus a protein-level predictor like ESM-2 tends to outperform either one alone.

Running Evo 2 through SciRouter

The most practical of the three to access is Evo 2, which is available through the DNA Lab workspace. You can POST a reference sequence and a list of variants and receive log-likelihood ratios in a single response. The Evo 2 tool page documents the request and response schemas. AlphaGenome and DNABERT are not currently hosted through SciRouter — run those directly if you need them, or keep an eye on the tools index for updates.

Bottom line

Evo 2 is the current state of the art for zero-shot variant scoring across the genome. AlphaGenome is the strongest regulatory-focused option. DNABERT remains a useful cheap baseline. Most real projects benefit from using more than one, and the best workflows combine a DNA-level model with a protein-level predictor for coding variants.

Try Evo 2 in DNA Lab →