Your Raw Data File Is Just a Text File
When you download your raw data from 23andMe, you get a zipped text file containing hundreds of thousands of lines. It looks intimidating at first glance, but the format is straightforward once you understand the four columns. This guide explains exactly what each part means and how to work with the data.
The File Format
The raw data file is a tab-separated values (TSV) file. Lines starting with # are comments — they contain metadata like the chip version and download date. The data lines have four columns separated by tabs:
# This data file generated by 23andMe at: Wed Jan 15 2026 08:42:33
# Below is a text version of your data.
# rsid chromosome position genotype
rs3094315 1 752566 AG
rs12562034 1 768448 GG
rs3934834 1 1005806 CT
rs9442372 1 1018704 AG
rs3737728 1 1021415 GG
rs11260588 1 1021695 --Each non-comment line represents one SNP (single nucleotide polymorphism) — a position in your genome where people commonly differ. A typical file has 600,000 to 730,000 of these entries.
What the Columns Mean
rsid — The SNP Identifier
The first column is the reference SNP identifier, a unique label assigned by the dbSNP database. It always starts with “rs” followed by a number (like rs3094315). This is the universal name for that variant position — researchers, clinicians, and databases all use the same rsid to refer to the same genomic position. You can paste any rsid into the SNP Lookup tool to find out what it does.
chromosome — Where in the Genome
The second column tells you which chromosome the SNP is on. Values range from 1 to 22 for autosomes, plus X, Y, and MT (mitochondrial DNA). Chromosomes 1–22 are present in two copies (one from each parent), which is why most genotypes have two letters. The X chromosome has two copies in genetic females and one in genetic males. Y chromosome and mitochondrial SNPs are inherited from one parent only.
position — The Exact Coordinate
The third column is the base-pair position on the chromosome, using coordinates from the GRCh37 (hg19) human reference genome assembly. This number tells bioinformatics tools exactly where in the chromosome the variant falls. Position 752566 on chromosome 1 means the 752,566th nucleotide on chromosome 1 according to the GRCh37 reference. If you use tools that expect GRCh38 coordinates, you will need to convert positions using a liftover tool.
genotype — Your Two Alleles
The fourth column is your genotype: the two nucleotide letters you carry at that position. Since you have two copies of each autosomal chromosome, you get two letters. The possible nucleotides are A (adenine), C (cytosine), G (guanine), and T (thymine).
- Homozygous — both alleles are the same (AA, CC, GG, TT). You inherited the same variant from both parents.
- Heterozygous — the alleles differ (AG, CT, etc.). You inherited one variant from each parent.
- No-call (--) — the chip could not determine your genotype at this position. This is normal and affects 1–3% of SNPs.
Understanding Effect Alleles
When a study reports that “the A allele at rs1234 is associated with increased risk,” the A is called the effect allele. If your genotype at rs1234 is AG, you carry one copy of the effect allele. If it is AA, you carry two copies. If it is GG, you carry none. The number of effect alleles you carry (0, 1, or 2) is called your dosage, and it determines the magnitude of the genetic effect for additive traits.
Chip Versions and SNP Counts
23andMe has used several different Illumina genotyping arrays over the years. The chip version determines which SNPs are in your file:
- Version 3 (2010–2013) — ~960,000 SNPs. Broadest coverage of any 23andMe chip.
- Version 4 (2013–2017) — ~570,000 SNPs. Smaller but added 23andMe custom content.
- Version 5 (2017–2020) — ~640,000 SNPs. More clinically relevant variants and pharmacogenomics coverage.
- GSA chip (2020–present) — ~730,000 SNPs. Illumina Global Screening Array with custom additions.
Not all SNPs overlap between chip versions. A variant present on version 3 may be absent from version 5, which is why third-party tools sometimes report that certain SNPs are not available in your data.
How to Look Up a Specific SNP
The fastest way to understand what a specific SNP does is to search your raw data file for the rsid and then look it up. Open your file in a text editor, use Ctrl+F (or Cmd+F on Mac) to search for the rsid, and note your genotype. Then paste the rsid into SciRouter's free SNP Lookup tool to get a plain-English explanation of the variant, associated traits, and what your genotype means.
Counting and Parsing SNPs with Python
For a quick summary of your data, here is a Python snippet that parses the file and reports basic statistics:
from collections import Counter
chroms = Counter()
no_calls = 0
total = 0
with open("genome_data.txt") as f:
for line in f:
if line.startswith("#") or not line.strip():
continue
parts = line.strip().split("\t")
if len(parts) == 4:
rsid, chrom, pos, genotype = parts
total += 1
chroms[chrom] += 1
if genotype == "--":
no_calls += 1
print(f"Total SNPs: {total:,}")
print(f"No-calls: {no_calls:,} ({100*no_calls/total:.1f}%)")
print(f"\nSNPs per chromosome:")
for c in sorted(chroms, key=lambda x: (x.isdigit(), int(x) if x.isdigit() else 0)):
print(f" chr{c}: {chroms[c]:,}")What You Can Learn from Your Data
Once you understand the file format, you can explore several areas:
- Pharmacogenomics — how you metabolize medications. Try the Pharmacogenomics Checker.
- Trait genetics — variants linked to taste, caffeine metabolism, earwax type, and more
- Health-related variants — APOE, MTHFR, BRCA, and other well-studied positions
- Regulatory variant effects — use AlphaGenome to predict how your variants affect gene regulation
- Comprehensive analysis — upload your file for a full personal genomics dashboard
Ready to explore your genome? Start with the free SNP Lookup or sign up for a free SciRouter account to access the full genomics API.