What is Geneformer in one sentence?

Geneformer is a transformer model pretrained on roughly 30 million human single-cell transcriptomes that represents every cell as a ranked list of its expressed genes, letting the model learn contextual biology without any cell-type labels.

What can Geneformer predict without any fine-tuning?

Out of the box, Geneformer can embed single cells into a biologically meaningful latent space, suggest likely cell types through nearest-neighbor lookups, identify candidate disease-perturbed genes via in-silico gene deletion, and rank transcription factors by their attention-based importance.

Do I need a GPU to run Geneformer?

Inference for a few thousand cells runs comfortably on a single modern GPU, and batched inference over full atlases scales well on A100-class hardware. SciRouter's hosted endpoint handles the GPU allocation so you can call the API from a laptop.

Geneformer or scGPT for a new project?

Geneformer is the mature choice when you care about gene-perturbation reasoning and contextual importance scores. scGPT tends to shine for generative workflows and batch integration across many datasets. Many teams try both on a pilot dataset before committing.

How does Geneformer handle dropouts and sparsity?

Because Geneformer encodes each cell as a ranked list of its top expressed genes rather than the raw count vector, it is naturally robust to dropout. A gene that drops out simply falls out of the rank list rather than becoming a noisy zero the model has to explain.

Where do I actually try it?

The fastest path is the Cell Atlas workspace, which exposes Geneformer plus complementary single-cell models through a single API. You can upload an expression matrix, get cell-type calls, embeddings, and marker genes back in seconds.

What is Geneformer? The Single-Cell Foundation Model Explained

Single-cell biology finally has a foundation model. Geneformer, released by the Broad Institute team in 2023 and refined across several updates since, is a transformer pretrained on roughly 30 million human single cells spanning dozens of tissues and developmental stages. It has become one of the reference models for anyone doing computational single-cell work, and it fits cleanly into the same mental slot that ESM-2 and AlphaFold occupy on the protein side.

This guide explains what Geneformer actually is, how its unusual gene-rank representation works, what you can do with it zero-shot, when fine-tuning pays off, and how to get started through the SciRouter Cell Atlas without spinning up your own GPUs.

Note

If you are new to single-cell foundation models, the one-line mental model is this: these are BERT-style transformers that treat a cell the way an NLP model treats a sentence, except the “tokens” are genes and their order carries biological meaning.

The core idea: genes as tokens, cells as documents

Traditional single-cell pipelines treat each cell as a vector of raw or normalized counts over roughly 20,000 genes. That is a very high- dimensional and very sparse representation — most genes are zero in any given cell, and the non-zero values are heavily skewed by technical noise. Classical methods burn a lot of effort on normalization, batch correction, and feature selection just to get a clean matrix before any downstream analysis can begin.

Geneformer side-steps that pipeline with a clever reframing. For each cell, it ranks the genes by their expression level, normalized against a reference distribution of median expression across the pretraining corpus. The model then receives only the top-ranked genes as an ordered sequence of tokens. No explicit counts, no batch scaling, just “which genes dominate this cell, and in what order?”

This representation does several things at once. It naturally normalizes away library size and sequencing depth. It focuses the model's attention on the informative tail of the expression distribution instead of the long flat floor. And it reframes a cell as an ordered token sequence — exactly the data shape that transformers were designed for.

The training corpus: Genecorpus-30M

Geneformer's pretraining corpus, called Genecorpus-30M, is assembled from hundreds of public scRNA-seq datasets. The curators filtered for human cells, removed low-quality barcodes, and normalized gene identifiers to a single reference. The result is a corpus that spans embryonic development, immune cells, brain, liver, kidney, heart, and many tumor contexts.

Scale matters here for the same reason it matters in language models. The more cells and tissues a foundation model sees during pretraining, the more contextual biology it can encode in its hidden states. A cell embedded by Geneformer carries implicit knowledge about how its transcriptional program relates to thousands of other cells the model has seen before — not just the cells in your experiment.

What Geneformer actually learns

Pretraining uses a masked language modeling objective adapted for biology. Some genes in the ranked sequence are masked, and the model has to predict them from context. To succeed, it has to learn the co-expression structure of the human transcriptome — which genes travel together, which gene modules define particular cell states, and which regulators sit at the top of those modules.

Three kinds of structure fall out of this:

Cell-type structure. Cells of the same type end up close together in the model's latent space, even across donors, labs, and chemistry versions.
Gene-network structure. Genes that the model learns to predict from each other form soft co-expression clusters that map onto known pathways.
Regulator importance. The attention weights the model assigns to genes — especially transcription factors — give a sensible ranking of regulatory influence in a given cell context.

Zero-shot use cases

The most immediately useful thing about Geneformer is how much you can do without any fine-tuning. Four zero-shot tasks are worth highlighting:

Cell embedding and nearest-neighbor annotation

Push a cell through the model, extract the CLS-style embedding, and you get a vector that lives in the same latent space as every cell in the pretraining corpus. Nearest-neighbor lookup against a labeled reference atlas produces a zero-shot cell-type call, often with accuracy comparable to dataset-specific classifiers.

In-silico gene deletion

Remove a gene from the ranked input and re-embed the cell. The difference between the original and perturbed embeddings is a proxy for the functional importance of that gene in that cell context. This is the basis of “in silico perturbation” workflows used to nominate disease genes or drug targets.

Transcription factor ranking

Attention-based importance scores over the input sequence give a principled ranking of which genes the model is leaning on to describe a given cell. Restricting this ranking to known transcription factors produces a candidate list of master regulators.

Cross-dataset batch integration

Because the rank-based input strips away library size and chemistry effects, Geneformer embeddings are remarkably robust across datasets. You can embed two independently collected atlases and mix them in the same latent space without a heavy batch-correction step.

When fine-tuning pays off

Zero-shot performance is strong, but fine-tuning still helps when you care about a narrow prediction task. Two patterns work well:

Classification heads. Freeze the Geneformer backbone and train a small linear or MLP head on your labeled cells. This is the standard transfer-learning recipe and it is cheap.
LoRA adapters. For more aggressive adaptation — say, predicting disease state from a rare tissue not well-represented in the pretraining corpus — LoRA adapters let you update a small fraction of parameters while keeping most of the pretrained weights fixed.

Note

Rule of thumb: if your downstream task is “assign each cell to a well-known type,” zero-shot is usually enough. If your task is “predict a patient outcome from cell composition,” you will want at least a classification head on top of the embeddings.

Limits and honest caveats

Geneformer is a strong model, but it is not magic. A few caveats are worth knowing before you deploy it in a real project:

Human-only. The pretraining corpus is human. Mouse and other model organisms work through ortholog mapping, but the latent space is tuned to human biology.
Rank input is lossy. The rank-based representation throws away absolute expression levels. If your downstream task genuinely depends on magnitude — for example, quantitative drug-dose response modeling — you may want to pair Geneformer with a count-based model.
Attention is not causation. Attention-based gene importance scores are informative but they are a correlation-based signal, not a causal claim. Follow up candidates with wet-lab validation or an independent causal method.

How to run Geneformer on SciRouter

The fastest way to try Geneformer without setting up GPU containers is through the Cell Atlas workspace. You can upload a sparse expression matrix, pick Geneformer as the backbone, and get embeddings, cell-type calls, and marker genes back in a single call. The same endpoint is available over REST for programmatic use, and there is an entry in the Geneformer tool page that documents the request and response schemas.

If you want to compare Geneformer against scGPT or other single-cell foundation models, both live behind the same API. Point your code at a different model name and rerun the same pipeline — that is the whole point of a gateway.

Bottom line

Geneformer is the clearest demonstration so far that the foundation model playbook transfers cleanly from language to single-cell biology. Its gene-rank representation is a clever piece of engineering, its zero-shot capabilities are genuinely useful, and fine-tuning costs are modest. If you work with scRNA-seq data, it belongs in your toolbox.

Try Geneformer in Cell Atlas →