Single-cell biology finally has a foundation model. Geneformer, released by the Broad Institute team in 2023 and refined across several updates since, is a transformer pretrained on roughly 30 million human single cells spanning dozens of tissues and developmental stages. It has become one of the reference models for anyone doing computational single-cell work, and it fits cleanly into the same mental slot that ESM-2 and AlphaFold occupy on the protein side.
This guide explains what Geneformer actually is, how its unusual gene-rank representation works, what you can do with it zero-shot, when fine-tuning pays off, and how to get started through the SciRouter Cell Atlas without spinning up your own GPUs.
The core idea: genes as tokens, cells as documents
Traditional single-cell pipelines treat each cell as a vector of raw or normalized counts over roughly 20,000 genes. That is a very high- dimensional and very sparse representation — most genes are zero in any given cell, and the non-zero values are heavily skewed by technical noise. Classical methods burn a lot of effort on normalization, batch correction, and feature selection just to get a clean matrix before any downstream analysis can begin.
Geneformer side-steps that pipeline with a clever reframing. For each cell, it ranks the genes by their expression level, normalized against a reference distribution of median expression across the pretraining corpus. The model then receives only the top-ranked genes as an ordered sequence of tokens. No explicit counts, no batch scaling, just “which genes dominate this cell, and in what order?”
This representation does several things at once. It naturally normalizes away library size and sequencing depth. It focuses the model's attention on the informative tail of the expression distribution instead of the long flat floor. And it reframes a cell as an ordered token sequence — exactly the data shape that transformers were designed for.
The training corpus: Genecorpus-30M
Geneformer's pretraining corpus, called Genecorpus-30M, is assembled from hundreds of public scRNA-seq datasets. The curators filtered for human cells, removed low-quality barcodes, and normalized gene identifiers to a single reference. The result is a corpus that spans embryonic development, immune cells, brain, liver, kidney, heart, and many tumor contexts.
Scale matters here for the same reason it matters in language models. The more cells and tissues a foundation model sees during pretraining, the more contextual biology it can encode in its hidden states. A cell embedded by Geneformer carries implicit knowledge about how its transcriptional program relates to thousands of other cells the model has seen before — not just the cells in your experiment.
What Geneformer actually learns
Pretraining uses a masked language modeling objective adapted for biology. Some genes in the ranked sequence are masked, and the model has to predict them from context. To succeed, it has to learn the co-expression structure of the human transcriptome — which genes travel together, which gene modules define particular cell states, and which regulators sit at the top of those modules.
Three kinds of structure fall out of this:
- Cell-type structure. Cells of the same type end up close together in the model's latent space, even across donors, labs, and chemistry versions.
- Gene-network structure. Genes that the model learns to predict from each other form soft co-expression clusters that map onto known pathways.
- Regulator importance. The attention weights the model assigns to genes — especially transcription factors — give a sensible ranking of regulatory influence in a given cell context.
Zero-shot use cases
The most immediately useful thing about Geneformer is how much you can do without any fine-tuning. Four zero-shot tasks are worth highlighting:
Cell embedding and nearest-neighbor annotation
Push a cell through the model, extract the CLS-style embedding, and you get a vector that lives in the same latent space as every cell in the pretraining corpus. Nearest-neighbor lookup against a labeled reference atlas produces a zero-shot cell-type call, often with accuracy comparable to dataset-specific classifiers.
In-silico gene deletion
Remove a gene from the ranked input and re-embed the cell. The difference between the original and perturbed embeddings is a proxy for the functional importance of that gene in that cell context. This is the basis of “in silico perturbation” workflows used to nominate disease genes or drug targets.
Transcription factor ranking
Attention-based importance scores over the input sequence give a principled ranking of which genes the model is leaning on to describe a given cell. Restricting this ranking to known transcription factors produces a candidate list of master regulators.
Cross-dataset batch integration
Because the rank-based input strips away library size and chemistry effects, Geneformer embeddings are remarkably robust across datasets. You can embed two independently collected atlases and mix them in the same latent space without a heavy batch-correction step.
When fine-tuning pays off
Zero-shot performance is strong, but fine-tuning still helps when you care about a narrow prediction task. Two patterns work well:
- Classification heads. Freeze the Geneformer backbone and train a small linear or MLP head on your labeled cells. This is the standard transfer-learning recipe and it is cheap.
- LoRA adapters. For more aggressive adaptation — say, predicting disease state from a rare tissue not well-represented in the pretraining corpus — LoRA adapters let you update a small fraction of parameters while keeping most of the pretrained weights fixed.
Limits and honest caveats
Geneformer is a strong model, but it is not magic. A few caveats are worth knowing before you deploy it in a real project:
- Human-only. The pretraining corpus is human. Mouse and other model organisms work through ortholog mapping, but the latent space is tuned to human biology.
- Rank input is lossy. The rank-based representation throws away absolute expression levels. If your downstream task genuinely depends on magnitude — for example, quantitative drug-dose response modeling — you may want to pair Geneformer with a count-based model.
- Attention is not causation. Attention-based gene importance scores are informative but they are a correlation-based signal, not a causal claim. Follow up candidates with wet-lab validation or an independent causal method.
How to run Geneformer on SciRouter
The fastest way to try Geneformer without setting up GPU containers is through the Cell Atlas workspace. You can upload a sparse expression matrix, pick Geneformer as the backbone, and get embeddings, cell-type calls, and marker genes back in a single call. The same endpoint is available over REST for programmatic use, and there is an entry in the Geneformer tool page that documents the request and response schemas.
If you want to compare Geneformer against scGPT or other single-cell foundation models, both live behind the same API. Point your code at a different model name and rerun the same pipeline — that is the whole point of a gateway.
Bottom line
Geneformer is the clearest demonstration so far that the foundation model playbook transfers cleanly from language to single-cell biology. Its gene-rank representation is a clever piece of engineering, its zero-shot capabilities are genuinely useful, and fine-tuning costs are modest. If you work with scRNA-seq data, it belongs in your toolbox.