BSMS205 · Genetics

Data Types for
Alleles and Populations

Chapter 24 · Part IV · Population Genetics
Today's central question

How do you turn
a genome
into data?

The universal answer · VCF

  • Variant Call Format
  • Every variant caller outputs VCF
  • Every database stores VCF or a sibling
  • Every analysis tool accepts VCF
The universal language of genetic variation.

Roadmap for today

  1. From sequencing to variant calling · the pipeline
  2. VCF file structure · the core columns
  3. The INFO field · population statistics
  4. Genotypes · FORMAT and GT
  5. Multi-sample VCF and joint genotyping
  6. Scaling · dense vs sparse · the future
§ 1

From Sequencing
to Variant Calling

The pipeline

  1. Extract DNA · sequence on Illumina → billions of short reads
  2. Align to reference (GRCh38) → BAM / CRAM file
  3. Variant call against reference → VCF file

Common tools: BWA · Bowtie2 · GATK · DeepVariant · FreeBayes

What the variant caller decides

  • Reference at chr1:12345 is A
  • Reads at this position show consistent T
  • Caller evaluates: read quality · mapping quality · strand bias · coverage
  • Decision: is this a real variant, or noise?
§ 2

VCF File Structure

A VCF is a plain text file

  • One line per variant
  • Each line: where the variant is, what changes, how confident
  • Tab-separated columns
  • Header lines start with #

Nine core columns

ColumnMeaning
CHROMChromosome (chr1, chrX, ...)
POSPosition (1-based)
IDIdentifier (rsXXXX or .)
REFReference allele
ALTAlternate allele
QUALQuality score
FILTERPASS or failure reason
INFOSite-level metadata
FORMATPer-sample field definitions

One line · read it out

chr1  1234567  rs12345  A  T  99.0  PASS  AC=3;AN=6;AF=0.5  GT:DP  0/1:32
  • chr1 pos 1,234,567 · ref A · alt T
  • Quality 99 · passes filters · rsID rs12345
  • 3 alternate alleles in 6 total (AF = 50%)
  • This sample is heterozygous · 32-read depth
§ 3

The INFO Field ·
Population Statistics

Three fundamental fields

FieldMeaning
AC Allele CountNumber of alternate alleles observed
AN Allele NumberTotal alleles successfully called · ideally 2 × N
AF Allele FrequencyAC ÷ AN

A concrete example

  • Sequence 100 people → 200 alleles total
  • 10 alternate alleles observed

AC

10

AN

200

AF

0.05

Why these three numbers matter

  • Tell you how common or rare a variant is
  • Help decide pathogenic vs benign
  • Feed directly into GWAS, burden tests, population genetics
§ 4

Genotypes ·
FORMAT and GT

The FORMAT column

  • Defines what information appears for each sample
  • Colon-separated list of field names
  • Example: GT:DP:AD
  • Each sample column follows the same template

Genotype codes · the heart of VCF

CodeMeaning
0/0Homozygous reference — no variant
0/1Heterozygous — one ref, one alt
1/1Homozygous alternate — both carry variant
./.Missing — genotype not called

Phased vs unphased

Unphased · /

  • Know which alleles are present
  • Don't know which came from mom vs dad
  • The default from variant calling

Phased · |

  • Know which allele is on which chromosome
  • Essential for haplotype work
  • Requires extra inference or family data

Supporting fields · DP and AD

FieldMeaningExample
DPTotal read depthDP=32
ADAllelic depth (ref, alt)AD=16,16
  • 0/1 + AD=16,16 → balanced heterozygote · consistent call
  • 0/1 + AD=28,4 → suspicious · possible sequencing error
§ 5

Multi-Sample VCF
and Joint Genotyping

Multi-sample VCF (msVCF)

  • One row per variant site
  • One column per sample
  • INFO summarises population · each sample column gives individual genotype
  • Same file · population and individual views simultaneously

Three samples · one variant

PersonGenotypeT alleles
Alice0/1 · heterozygous1
Bob1/1 · homozygous alt2
Carol0/0 · homozygous ref0

AC = 3 · AN = 6 · AF = 0.5

Variant ≠ Genotype

A variant is a site where an alternate allele exists in the population.
A genotype is an individual's specific allele combination.

One variant · many genotypes.
Variant = population-level · genotype = individual-level.

Joint genotyping

  • Call variants across all samples simultaneously
  • Even sites where most people are 0/0 get an explicit call
  • Prevents missing data from corrupting AC / AN / AF
  • Standard for gnomAD, UK Biobank, All of Us

The attendance analogy

Without joint genotyping

  • Only write down who answered present
  • Don't know who was absent
  • Incomplete information

With joint genotyping

  • Full roster of who is present or absent
  • Complete information
  • Correct AF calculations
§ 6

Scaling ·
Dense vs Sparse

The scaling problem

  • UK Biobank · sequenced ~500,000 people
  • Dense VCF: one genotype entry per sample per site
  • For a rare variant: 499,990 samples are 0/0
  • Storing all those 0/0s is wasteful

Sparse VCF

  • Store only non-reference genotypes
  • 0/0 entries are omitted, implied from absence
  • Population stats (AC, AN, AF) still in INFO — no information loss
  • For 500k samples · rare variant stores ~10 entries, not 500,000

Dense vs sparse · a comparison

AspectDenseSparse
StorageAll genotypes explicitOnly non-ref genotypes
File sizeVery large, superlinear growthLinear with carriers
Best for10s–1000s of samplesBiobank scale (100k+)
Computational feasibilityBreaks at biobank scaleEnables modern genomics
§ 7

The Future of VCF

The fundamental challenge

  • Adding new samples means updating every existing row
  • File size grows superlinearly with sample count
  • gnomAD export in VCF = petabyte range
  • Doubling gnomAD → infeasible in standard VCF

GA4GH Future of VCF Working Group

  • Global Alliance for Genomics and Health · stewards the VCF standard
  • Future of VCF group formed 2019, meets monthly
  • Interoperability with htsget · Beacon · CRAM / SAM / BAM

Three approaches to scaling

  1. Specification tweaks · SAV, SVCR, spVCF · keep the text format, compress smartly
  2. New binary format · engineered for large scale and targeted queries
  3. API-based hybrid · query protocol returns data in any VCF form

Why this scaling effort matters

  • Enables reuse of large population-genomics datasets
  • Supports virtual cohorts — combining studies to gain statistical power
  • Especially valuable for rare disease research
§ 8

Why VCF Matters

Before VCF · the wild west

  • Every lab used its own format
  • Cross-study comparison was a nightmare
  • Custom converters for every pair of tools
  • Reproducibility was hard

After VCF · the universal standard

  • Directly compare variants across labs and technologies
  • gnomAD, ClinVar, dbSNP distribute in VCF
  • Clinical reporting · research · everything
  • The same format scales from one exome to one million genomes
§ 9

Summary

What to take away

  • Pipeline: reads → BAM → VCF
  • VCF = plain text · nine core columns · one line per variant
  • INFO: AC · AN · AF — population-level statistics
  • FORMAT + GT: 0/0 · 0/1 · 1/1 · ./.
  • Joint genotyping = complete data across all samples
  • Sparse VCF = future-proof scaling to biobank size

The bigger picture

VCF is how genomes become data — how biological molecules become numbers that computers can analyse, compare, and share.
  • From a single A→T change
  • …to biobanks with millions of variants across hundreds of thousands of people
  • Encoded as 0/0, 0/1, 1/1
End of Part IV

Allele frequency · population structure ·
linkage · recombination · VCF

Next: Part V · Functional Genetics