BSMS205 · Genetics
Data Types for
Alleles and Populations
Chapter 24 · Part IV · Population Genetics
Today's central question
How do you turn
a genome
into data?
The universal answer · VCF
- Variant Call Format
- Every variant caller outputs VCF
- Every database stores VCF or a sibling
- Every analysis tool accepts VCF
The universal language of genetic variation.
Roadmap for today
- From sequencing to variant calling · the pipeline
- VCF file structure · the core columns
- The INFO field · population statistics
- Genotypes · FORMAT and GT
- Multi-sample VCF and joint genotyping
- Scaling · dense vs sparse · the future
§ 1
From Sequencing
to Variant Calling
The pipeline
- Extract DNA · sequence on Illumina → billions of short reads
- Align to reference (GRCh38) → BAM / CRAM file
- Variant call against reference → VCF file
Common tools: BWA · Bowtie2 · GATK · DeepVariant · FreeBayes
What the variant caller decides
- Reference at chr1:12345 is
A
- Reads at this position show consistent
T
- Caller evaluates: read quality · mapping quality · strand bias · coverage
- Decision: is this a real variant, or noise?
§ 2
VCF File Structure
A VCF is a plain text file
- One line per variant
- Each line: where the variant is, what changes, how confident
- Tab-separated columns
- Header lines start with
#
Nine core columns
| Column | Meaning |
| CHROM | Chromosome (chr1, chrX, ...) |
| POS | Position (1-based) |
| ID | Identifier (rsXXXX or .) |
| REF | Reference allele |
| ALT | Alternate allele |
| QUAL | Quality score |
| FILTER | PASS or failure reason |
| INFO | Site-level metadata |
| FORMAT | Per-sample field definitions |
One line · read it out
chr1 1234567 rs12345 A T 99.0 PASS AC=3;AN=6;AF=0.5 GT:DP 0/1:32
- chr1 pos 1,234,567 · ref
A · alt T
- Quality 99 · passes filters · rsID
rs12345
- 3 alternate alleles in 6 total (AF = 50%)
- This sample is heterozygous · 32-read depth
§ 3
The INFO Field ·
Population Statistics
Three fundamental fields
| Field | Meaning |
| AC Allele Count | Number of alternate alleles observed |
| AN Allele Number | Total alleles successfully called · ideally 2 × N |
| AF Allele Frequency | AC ÷ AN |
A concrete example
- Sequence 100 people → 200 alleles total
- 10 alternate alleles observed
Why these three numbers matter
- Tell you how common or rare a variant is
- Help decide pathogenic vs benign
- Feed directly into GWAS, burden tests, population genetics
§ 4
Genotypes ·
FORMAT and GT
The FORMAT column
- Defines what information appears for each sample
- Colon-separated list of field names
- Example:
GT:DP:AD
- Each sample column follows the same template
Genotype codes · the heart of VCF
| Code | Meaning |
0/0 | Homozygous reference — no variant |
0/1 | Heterozygous — one ref, one alt |
1/1 | Homozygous alternate — both carry variant |
./. | Missing — genotype not called |
Phased vs unphased
Unphased · /
- Know which alleles are present
- Don't know which came from mom vs dad
- The default from variant calling
Phased · |
- Know which allele is on which chromosome
- Essential for haplotype work
- Requires extra inference or family data
Supporting fields · DP and AD
| Field | Meaning | Example |
| DP | Total read depth | DP=32 |
| AD | Allelic depth (ref, alt) | AD=16,16 |
0/1 + AD=16,16 → balanced heterozygote · consistent call
0/1 + AD=28,4 → suspicious · possible sequencing error
§ 5
Multi-Sample VCF
and Joint Genotyping
Multi-sample VCF (msVCF)
- One row per variant site
- One column per sample
- INFO summarises population · each sample column gives individual genotype
- Same file · population and individual views simultaneously
Three samples · one variant
| Person | Genotype | T alleles |
| Alice | 0/1 · heterozygous | 1 |
| Bob | 1/1 · homozygous alt | 2 |
| Carol | 0/0 · homozygous ref | 0 |
→ AC = 3 · AN = 6 · AF = 0.5
Variant ≠ Genotype
A variant is a site where an alternate allele exists in the population.
A genotype is an individual's specific allele combination.
One variant · many genotypes.
Variant = population-level · genotype = individual-level.
Joint genotyping
- Call variants across all samples simultaneously
- Even sites where most people are 0/0 get an explicit call
- Prevents missing data from corrupting AC / AN / AF
- Standard for gnomAD, UK Biobank, All of Us
The attendance analogy
Without joint genotyping
- Only write down who answered present
- Don't know who was absent
- Incomplete information
With joint genotyping
- Full roster of who is present or absent
- Complete information
- Correct AF calculations
§ 6
Scaling ·
Dense vs Sparse
The scaling problem
- UK Biobank · sequenced ~500,000 people
- Dense VCF: one genotype entry per sample per site
- For a rare variant: 499,990 samples are 0/0
- Storing all those 0/0s is wasteful
Sparse VCF
- Store only non-reference genotypes
- 0/0 entries are omitted, implied from absence
- Population stats (AC, AN, AF) still in INFO — no information loss
- For 500k samples · rare variant stores ~10 entries, not 500,000
Dense vs sparse · a comparison
| Aspect | Dense | Sparse |
| Storage | All genotypes explicit | Only non-ref genotypes |
| File size | Very large, superlinear growth | Linear with carriers |
| Best for | 10s–1000s of samples | Biobank scale (100k+) |
| Computational feasibility | Breaks at biobank scale | Enables modern genomics |
§ 7
The Future of VCF
The fundamental challenge
- Adding new samples means updating every existing row
- File size grows superlinearly with sample count
- gnomAD export in VCF = petabyte range
- Doubling gnomAD → infeasible in standard VCF
GA4GH Future of VCF Working Group
- Global Alliance for Genomics and Health · stewards the VCF standard
- Future of VCF group formed 2019, meets monthly
- Interoperability with htsget · Beacon · CRAM / SAM / BAM
Three approaches to scaling
- Specification tweaks · SAV, SVCR, spVCF · keep the text format, compress smartly
- New binary format · engineered for large scale and targeted queries
- API-based hybrid · query protocol returns data in any VCF form
Why this scaling effort matters
- Enables reuse of large population-genomics datasets
- Supports virtual cohorts — combining studies to gain statistical power
- Especially valuable for rare disease research
§ 8
Why VCF Matters
Before VCF · the wild west
- Every lab used its own format
- Cross-study comparison was a nightmare
- Custom converters for every pair of tools
- Reproducibility was hard
After VCF · the universal standard
- Directly compare variants across labs and technologies
- gnomAD, ClinVar, dbSNP distribute in VCF
- Clinical reporting · research · everything
- The same format scales from one exome to one million genomes
§ 9
Summary
What to take away
- Pipeline: reads → BAM → VCF
- VCF = plain text · nine core columns · one line per variant
- INFO: AC · AN · AF — population-level statistics
- FORMAT + GT:
0/0 · 0/1 · 1/1 · ./.
- Joint genotyping = complete data across all samples
- Sparse VCF = future-proof scaling to biobank size
The bigger picture
VCF is how genomes become data — how biological molecules become numbers that computers can analyse, compare, and share.
- From a single A→T change
- …to biobanks with millions of variants across hundreds of thousands of people
- Encoded as
0/0, 0/1, 1/1
End of Part IV
Allele frequency · population structure ·
linkage · recombination · VCF
Next: Part V · Functional Genetics