Single-CellSingle-Cell Omics

Cell Type Prediction from Sparse Expression Matrices: A Tutorial

A hands-on tutorial for predicting cell types from sparse gene expression data using Geneformer via API.

SciRouter Team
April 11, 2026
11 min read

The sparse expression matrix is the lingua franca of single-cell RNA-seq. Every mainstream pipeline — scanpy, Seurat, Bioconductor — ultimately stores data as a compressed sparse row (CSR) or compressed sparse column (CSC) structure. If you want to plug a modern foundation model into your analysis, the first question is how to ship that sparse matrix across a network without exploding it into a dense array. This tutorial walks through exactly that.

By the end you will have a working Python function that takes an AnnData or scipy sparse matrix and returns cell-type calls, embeddings, and marker genes through the SciRouter Cell Atlas API. No GPU on your side, no Celery worker, no dense blow-up.

Note
The scripts here use only standard Python scientific libraries — numpy, scipy, requests, and optionally anndata. Nothing exotic.

A quick refresher on CSR

Scipy's csr_matrix stores a sparse matrix as three dense arrays:

  • data — the non-zero values, concatenated row by row.
  • indices — the column index of each value in data.
  • indptr — a length-(n_rows+1) array where indptr[i] to indptr[i+1] gives the slice of data belonging to row i.

That is it. Three arrays, plus the shape tuple, fully specify the matrix. Serialize them to JSON and you have a portable representation you can send over HTTP without expanding anything.

Building the API payload

The Cell Atlas annotation endpoint accepts a payload with a matrix field containing exactly those three arrays plus the shape. Here is the serialization step:

python
from scipy.sparse import csr_matrix

def sparse_to_payload(X, genes):
    """Turn a (cells, genes) CSR matrix into an API-ready dict."""
    if not isinstance(X, csr_matrix):
        X = csr_matrix(X)
    return {
        "genes": list(genes),
        "matrix": {
            "indptr": X.indptr.tolist(),
            "indices": X.indices.tolist(),
            "data": X.data.tolist(),
            "shape": list(X.shape),
        },
    }

The gene list must match the columns of the matrix in the same order. Double-check this — mismatched gene order is the most common source of subtle annotation failures.

Calling the annotation endpoint

Wrap the payload in a POST request and hand off to the API:

python
import requests

API_URL = "https://scirouter-gateway-production.up.railway.app/v1/singlecell/annotate"
API_KEY = "sk-sci-your-api-key-here"

def annotate(X, genes, model="geneformer", atlas="human-core-2026"):
    payload = sparse_to_payload(X, genes)
    payload["model"] = model
    payload["reference_atlas"] = atlas
    headers = {"Authorization": f"Bearer {API_KEY}"}
    r = requests.post(API_URL, json=payload, headers=headers, timeout=300)
    r.raise_for_status()
    return r.json()

The response contains a calls array with one entry per cell. Each entry has label, confidence,alternates, and markers.

Running it from scanpy

Scanpy users will recognize the AnnData object. Here is a scanpy-compatible wrapper:

python
import scanpy as sc

def annotate_adata(adata, model="geneformer"):
    result = annotate(adata.X, adata.var_names, model=model)
    calls = result["calls"]
    adata.obs["cell_type"] = [c["label"] for c in calls]
    adata.obs["cell_type_confidence"] = [c["confidence"] for c in calls]
    return adata

adata = sc.read_h5ad("pbmc_10k.h5ad")
adata = annotate_adata(adata)
print(adata.obs["cell_type"].value_counts())

Four lines of annotation logic, and you have labels plus confidence scores written back to the AnnData object ready for downstream plotting and filtering.

Reading top marker genes from the response

Each call includes the top marker genes that drove the label, which is useful for sanity checking. For example, you expect CD8 T cells to have CD8A, CD8B, GZMK, and friends in their marker list. A confident call where the marker genes look wrong is a red flag.

python
for i, call in enumerate(result["calls"][:5]):
    print(f"cell {i}: {call['label']} ({call['confidence']:.2f})")
    print("  top markers:", ", ".join(call["markers"][:5]))

Normalization: what to do and what to skip

Rank-based foundation models handle library size for you, but basic QC is still non-negotiable. Do this before you call the API:

  • Filter empty droplets. Drop cells with fewer than some minimum total count.
  • Filter dying cells. Drop cells with a mitochondrial fraction above 10 to 20 percent depending on tissue.
  • Remove doublets. Scrublet or DoubletFinder both work.
  • Check gene symbols. Make sure your gene identifiers match the reference gene set. Ensembl IDs vs HGNC symbols is a common mismatch.
Warning
What you do NOT need to do beforehand: log1p normalization, z-scoring, highly-variable-gene selection, PCA. The rank transform inside the model takes care of those steps.

Handling large atlases

The endpoint accepts matrices up to about 100,000 cells per request. For larger atlases, chunk the matrix and merge the results:

python
import numpy as np
from scipy.sparse import vstack

def annotate_large(X, genes, chunk_size=50_000):
    all_calls = []
    for start in range(0, X.shape[0], chunk_size):
        stop = min(start + chunk_size, X.shape[0])
        chunk = X[start:stop]
        result = annotate(chunk, genes)
        all_calls.extend(result["calls"])
    return all_calls

This keeps each request within API limits and preserves the order of cells in the output.

Troubleshooting

All cells come back as the same type

Almost always a gene-symbol mismatch. Verify that adata.var_names contains the identifier type the API expects (HGNC symbols by default).

Low confidence across the entire dataset

Check your QC. Datasets with high doublet rates or ambient-RNA contamination score low across the board.

Mysterious marker genes

Non-human orthologs mapped through a reference ortholog table can produce noisy marker outputs. If you are working with mouse or other organisms, confirm the orthology table used by the endpoint.

Where to go next

Once annotation is working, try the companion endpoints: embedding extraction for dimensionality reduction and marker-gene export for downstream enrichment analysis. The Geneformer tool page lists all of them, and the Cell Atlas workspace wraps the same pipeline in a browser UI if you want to explore before scripting.

Bottom line

Sending a sparse matrix to a cell-type prediction API is not more complicated than a few scipy and requests calls. Once you have the wrapper function, you can swap foundation models, tweak atlases, and plug the output straight back into scanpy without rewriting anything.

Open the Cell Atlas workspace →

Frequently Asked Questions

Why sparse matrices and not dense arrays?

A typical 10x dataset is 90 to 99 percent zeros. Dense representation wastes memory, bandwidth, and serialization time. CSR and CSC formats store only the non-zero entries and are the standard input format for every serious single-cell tool.

Do I need to normalize counts before sending them?

With rank-based foundation models like Geneformer, no heavy normalization is needed. The rank transform handles library size automatically. Standard QC — removing empty droplets, high-mito cells, and doublets — is still required.

What is the maximum matrix size I can send in one request?

The hosted annotation endpoint accepts up to about 100,000 cells per request. For larger atlases, chunk the matrix into batches of 50,000 to 100,000 cells and merge the results client-side.

What cell types does the reference atlas cover?

The default human-core atlas covers major immune, stromal, epithelial, and parenchymal cell types across common tissues. Rare or tissue-specific subtypes may come back with low confidence scores — review those manually.

Can I run this with scanpy?

Yes. A scanpy AnnData object exposes X (expression matrix), var_names (genes), and obs_names (cells) — which is exactly what you need to build the API payload. The walkthrough below shows a full scanpy-compatible example.

How long does a typical call take?

Latency is dominated by upload time for the sparse matrix. A 10,000-cell PBMC dataset typically returns labels, embeddings, and marker genes in under 30 seconds on a warm endpoint.

Try this yourself

500 free credits. No credit card required.