DNA ModelsDNA Foundation Models

Predict Your 23andMe Variants Without an Ancestry Subscription

A privacy-first tutorial for analyzing your own 23andMe file. Client-side parsing, ClinVar matching, and what your variants mean.

SciRouter Team
April 11, 2026
12 min read

Consumer genetic testing services like 23andMe and Ancestry give you a raw data file at the bottom of the account settings that nobody ever looks at. Inside that file is exactly the same genotype data the services use to generate their reports — and a lot more besides. With a bit of knowledge and a few open-source tools, you can do your own variant interpretation without paying a subscription, and without uploading your genome to a third party.

This guide shows how to parse the raw data file client-side, match variants against ClinVar, and score coding variants with ESM-2 through the SciRouter free variant-predictor tool. Everything that needs to stay private stays on your machine.

Warning
This is a research and personal-curiosity workflow. Do not use any score or match as a clinical diagnosis. If something looks concerning, talk to a genetic counselor or physician — do not make health decisions based on this kind of analysis alone.

What is in a raw data file

A 23andMe raw data file is a plain text file with roughly 600,000 to 700,000 rows, one row per genotyped position. Each row has four columns: the rsID (a dbSNP identifier), the chromosome, the position, and your genotype at that position. Ancestry and other services use a similar format with minor column differences.

Every row is a single-nucleotide polymorphism (SNP) the chip was designed to measure. Not every interesting variant is on every chip, but the overlap with ClinVar is substantial — you typically get tens of thousands of matches from a single file.

Parsing client-side

The raw file is small enough to parse in the browser. A few hundred thousand lines of tab-separated text loads in a second or two on any modern laptop. The free tool at SciRouter does this parsing entirely in client-side JavaScript, so the file contents never leave your machine.

The output of the parser is a map from rsID to genotype, plus a secondary index from chromosome-position pairs to genotype for variants that do not have a rsID. That is everything you need for the next step.

Matching against ClinVar

ClinVar is a public database of genetic variants curated with clinical significance annotations. For each variant, it records the reference and alternate alleles, the gene it falls in, a pathogenicity call (benign, likely benign, uncertain, likely pathogenic, pathogenic), and a list of submitting laboratories.

Matching your raw data against ClinVar is a simple join. For every rsID in your file, check whether it appears in the ClinVar index. If it does, pull the clinical annotation and check your genotype against the reported risk alleles. The output is a list of variants where your genotype matches a known functional effect.

Note
Most of your matches will be benign or likely benign. That is expected and reassuring. The interesting variants are in the “uncertain” category — those are the ones where additional scoring methods add real value.

Scoring novel and uncertain variants

For variants that are not in ClinVar, or are marked as “uncertain significance,” you can use a protein language model to get a zero-shot prior on functional impact. ESM-2 is the standard choice for this. It takes a protein sequence plus a single amino-acid substitution and returns a log-likelihood ratio that correlates with functional effect.

The workflow is straightforward:

  • Translate the gene transcript around the variant into its protein sequence.
  • Apply the amino-acid change implied by the DNA variant.
  • Send the protein sequence and the substitution to the ESM-2 scoring endpoint.
  • Read the log-likelihood ratio. More negative values suggest higher functional impact.

This is the same method used in research for variant effect prediction across the proteome. It is not diagnostic on its own, but it is a useful prior that complements ClinVar.

Privacy design

A few design choices keep your raw data private through this workflow:

  • Client-side parsing. The raw file is loaded and parsed in your browser. It never reaches a server.
  • Client-side ClinVar matching. The ClinVar subset is shipped as a static asset and matched against your variants locally. No API calls involving your full genome.
  • Per-variant API calls. Only the protein sequence and single amino-acid substitution for a variant you choose to score are sent to the ESM-2 endpoint. The endpoint never sees your other genotypes, your file, or your identity.

The trade-off is that the ESM-2 scoring step does reveal which variant you are curious about. If even that is too much, you can run ESM-2 locally with the open-source weights, at the cost of needing a GPU.

What to expect when you run it

A typical 23andMe raw data file will produce tens of thousands of ClinVar matches, the vast majority of which are benign. You will typically see a handful of “likely pathogenic” or “pathogenic” annotations — most of these are carrier-status findings for recessive conditions where a single copy is not a health risk.

The more interesting category is variants of uncertain significance. These are where a zero-shot protein-level score genuinely helps you prioritize what to look at. A variant with a strong negative ESM-2 log ratio in a disease-relevant gene is worth investigating further; a variant with a neutral score probably is not.

Limits of consumer genotyping chips

Keep in mind what consumer chips do and do not cover:

  • SNPs only, not insertions or deletions. Most of the really impactful variants in cancer-predisposition genes are indels, and consumer chips do not detect them.
  • Designed positions. Chips genotype pre-selected positions. A variant not on the chip is not in your raw data, which is different from being absent from your genome.
  • Error rates. Consumer chips are highly accurate overall but still have error rates meaningful for rare variants. A single rare-variant call should always be confirmed by an orthogonal method.

When to involve a professional

If this workflow surfaces a concerning variant — a pathogenic call in a cancer-risk gene, for example — the next step is not more computation. It is a consultation with a genetic counselor and, if appropriate, a clinical-grade confirmatory test. Research scores, including ESM-2 log ratios, are not a substitute for clinical confirmation.

Running the workflow

The SciRouter variant predictor wraps the whole pipeline into a single browser app. Upload your raw data file, see matching ClinVar variants, and score uncertain ones with ESM-2 — all free and all client-side except the per-variant protein scoring call.

Bottom line

You do not need an ongoing ancestry subscription to get real value from your consumer genotype file. A client-side parser, a ClinVar matcher, and a protein-level variant scorer are all the ingredients you need. Keep the scores in research framing, talk to a professional about anything concerning, and enjoy the rare pleasure of owning your own data.

Try the free variant predictor →

Frequently Asked Questions

Is this a replacement for a clinical genetic test?

No. This workflow is for research and personal curiosity. Clinical genetic interpretation requires a certified laboratory, a trained professional, and orthogonal evidence. Nothing in this tutorial is a diagnosis.

Does my raw data leave my computer?

The parsing and ClinVar matching run entirely in the browser. Only the amino acid substitution for a specific variant you choose to score — never the whole file — is sent to the ESM-2 scoring endpoint.

What file formats work?

23andMe raw data TXT, Ancestry raw data TSV, and generic VCF files all work. The parser detects the format from the header line and extracts the relevant columns.

Why ESM-2 instead of Evo 2 for this?

For coding variants specifically, ESM-2 on the translated protein is often faster, cheaper, and comparably accurate. For non-coding variants Evo 2 is the better choice. A real workflow uses both.

Can I trust a variant with a low ClinVar match rate?

Not all consumer genotyping chips cover every position in ClinVar. A variant missing from your raw data is not necessarily absent — it might just not have been genotyped. Zero rare variants is often a coverage artifact, not a biological result.

Is any of this free?

Yes. The free tool at SciRouter parses your raw data, matches it against ClinVar, and lets you score selected coding variants with ESM-2 at no cost. See the predict-your-variant tool for the hosted version.

Try this yourself

500 free credits. No credit card required.