Variant effect prediction — deciding whether a DNA change is functional or benign — is one of the cleanest benchmarks for DNA foundation models. Three models define the current landscape: Evo 2 from the Arc Institute, AlphaGenome from DeepMind, and DNABERT, the older baseline that anchored early work on DNA transformers. This article lines them up side by side so you can pick the right one for your project.
The short version: Evo 2 leads on raw zero-shot capability thanks to its scale and context length, AlphaGenome is strong on regulatory regions and slightly more stable, and DNABERT is mostly a historical baseline now. All three make sense in different contexts, and sometimes combining them produces the best results.
Quick summary of the three contenders
Evo 2
- Author: Arc Institute.
- Training corpus: about 9 trillion base pairs across bacteria, archaea, phages, and eukaryotes.
- Context window: close to one million nucleotide tokens.
- Architecture: autoregressive transformer-style model with efficient attention.
- Parameter scale: tens of billions.
AlphaGenome
- Author: Google DeepMind.
- Training corpus: large aggregated genomics data with strong emphasis on regulatory annotation tracks.
- Context window: tens to hundreds of thousands of base pairs.
- Architecture: transformer-based, with supervision from functional genomics tracks as well as sequence alone.
- Parameter scale: large, comparable to other DeepMind biology models.
DNABERT
- Author: academic research group, originally 2021.
- Training corpus: human genome at k-mer resolution.
- Context window: a few thousand base pairs.
- Architecture: BERT-style masked language model on k-mer tokens.
- Parameter scale: hundreds of millions.
Head-to-head: scale and context
Scale is the most visible difference between these three models, and it matters more than most other architectural choices. Evo 2 sits in a different league on both corpus size and parameter count. AlphaGenome is a serious large model but trained with a different emphasis. DNABERT is a 2021-era baseline that was never designed to compete on scale.
Context length follows a similar pattern. Evo 2's million-token context means it can see an entire gene plus its regulatory landscape in a single forward pass. AlphaGenome handles long contexts well, though not at Evo 2's extreme. DNABERT is stuck at a few thousand base pairs, which structurally prevents it from capturing long-range regulatory effects.
Zero-shot variant effect prediction
Across published and community benchmarks, Evo 2 generally leads on zero-shot variant scoring. Its likelihood-ratio approach is simple, well-calibrated, and benefits directly from the scale of the training data. For a random variant pulled from ClinVar or a deep mutational scan, Evo 2's scores correlate more strongly with functional ground truth than either of the other two.
AlphaGenome is close behind in most benchmarks and sometimes pulls ahead on regulatory variants specifically. The supervision from functional genomics tracks during training gives it a sharper signal for enhancer and promoter variants in particular.
DNABERT trails on zero-shot variant scoring for essentially all tasks. It was not trained to produce calibrated likelihood ratios, and its small context makes long-range effects invisible. It remains useful as a cheap baseline or when you only care about short k-mer features.
Regulatory elements and functional annotation
For annotating regulatory elements — promoters, enhancers, insulators — AlphaGenome has an edge because it was explicitly trained to predict functional tracks like chromatin accessibility and histone marks. Evo 2 can do this implicitly through its per-position likelihoods, and it does it well, but AlphaGenome tends to produce crisper calls when the task is strictly regulatory annotation.
Generative sequence design
Evo 2 wins on generative design because it is autoregressive. Sampling new sequences conditioned on genomic context is a natural operation for an autoregressive model. AlphaGenome is less commonly used for generation. DNABERT is not a generative model in any practical sense.
Compute cost and accessibility
Compute cost is a real consideration, especially for large population studies where you might want to score millions of variants:
- DNABERT is cheap. You can run it on a single GPU and score millions of short variants quickly.
- AlphaGenome is expensive but tractable on a handful of GPUs for typical study sizes.
- Evo 2 is the most expensive, especially at full context length. Hosted APIs make this manageable, but at-scale variant scoring still requires planning.
Practical decision guide
Pick Evo 2 when…
- You need zero-shot variant scores anywhere in the genome.
- Long-range regulatory context matters for your variants.
- You want generative sequence design as part of the workflow.
- You are already using a hosted API and the model cost is abstracted away.
Pick AlphaGenome when…
- Your focus is specifically on regulatory elements.
- You want a model that was trained with functional annotation tracks as supervision.
- Stability and calibration matter as much as raw ranking performance.
Pick DNABERT when…
- You need a cheap baseline to compare against.
- Your task is narrow enough that a k-mer-level model is sufficient.
- You lack GPU resources for the larger models.
Running Evo 2 through SciRouter
The most practical of the three to access is Evo 2, which is available through the DNA Lab workspace. You can POST a reference sequence and a list of variants and receive log-likelihood ratios in a single response. The Evo 2 tool page documents the request and response schemas. AlphaGenome and DNABERT are not currently hosted through SciRouter — run those directly if you need them, or keep an eye on the tools index for updates.
Bottom line
Evo 2 is the current state of the art for zero-shot variant scoring across the genome. AlphaGenome is the strongest regulatory-focused option. DNABERT remains a useful cheap baseline. Most real projects benefit from using more than one, and the best workflows combine a DNA-level model with a protein-level predictor for coding variants.