BSMS205 · Genetics

Introduction to
GWAS

Chapter 18 · Part III · Complex Traits
A question to start with

We know genes shape height.
But which genes?

The needle-in-a-haystack problem

  • The genome has ~3 billion base pairs
  • ~10 million common variants segregate in humans
  • Most have tiny effect on any one trait
  • Pre-GWAS methods could only chase large-effect Mendelian genes
Need: a way to scan the whole genome, unbiased, at once.
The core idea

Compare allele frequencies
in cases vs controls
for millions of SNPs at once.

What we cover today

  1. What GWAS is — and what it tests
  2. Why the threshold is 5 × 10⁻⁸
  3. How to read a Manhattan plot
  4. How to read a QQ plot
  5. Linkage disequilibrium · lead SNP ≠ causal SNP
  6. Biobanks · ancestry · the DCM case study
  7. What it gives us — and what it doesn't
§ 1

What
GWAS Is

SNPs · the alphabet of variation

  • SNP = Single Nucleotide Polymorphism
  • One-letter difference at a fixed genomic position
  • Common: minor allele frequency > 1%
  • Not the rare disease-causing mutations of Mendelian disorders
The everyday genetic differences between people.

Two flavors of GWAS

Case–control

  • People with vs without a disease
  • Test allele frequency difference
  • Effect size = odds ratio (OR)
  • e.g. heart disease, schizophrenia

Quantitative trait

  • Continuous measure across people
  • Test allele effect on the value
  • Effect size = beta (β)
  • e.g. height, BMI, cholesterol

How do we measure SNPs?

SNP arrays

  • ~500,000 to 1 million SNPs probed
  • Cheap: ~$30–50 per sample
  • Imputation infers the rest

Whole-genome sequencing

  • Reads every base
  • ~$200–1,000 per sample
  • Captures rare variants too

Most GWAS still use arrays + imputation — biobank-scale.

The single test, repeated millions of times

  • For every SNP i: test whether allele frequency differs between cases and controls
  • Output: a p-value — between 0 and 1
  • Small p = data unlikely if SNP had no effect
  • That is all a single GWAS test does
§ 2

Why
5 × 10⁻⁸?

The multiple testing problem

  • Test 1 SNP at p < 0.05: 5% false-positive rate — fine
  • Test 1,000,000 SNPs at p < 0.05: expect 50,000 false positives
  • You'd find "associations" where there are none
  • The naïve threshold is useless at genome scale

The Bonferroni correction

αgenome-wide = α / Ntests = 0.05 / 1,000,000 = 5 × 10⁻⁸
  • ~1 million independent common-variant tests in the genome
  • Divide α = 0.05 by 1,000,000 → keeps total false-positive rate ≈ 5%
  • Conservative · but reproducible across studies
The threshold
5 × 10⁻⁸
genome-wide significance · p-value
  • = 0.00000005 — five in one hundred million
  • = −log₁₀(p) ≈ 7.3 on the y-axis
  • Universal across modern GWAS

What "significant" actually means

A p < 5 × 10⁻⁸ result rejects the null
— it does not prove causation.
  • The SNP itself may not be the causal variant
  • Could be a nearby variant in linkage disequilibrium
  • Could reflect population stratification
  • Statistical association ≠ biological mechanism
§ 3

The
Manhattan Plot

The skyline of significance

  • x-axis: chromosomal position · chr 1 → chr 22
  • y-axis: −log₁₀(p) for each SNP
  • Each dot = one SNP
  • Dashed line at y ≈ 7.3 = the 5 × 10⁻⁸ threshold

Peaks above the line look like the Manhattan skyline.

Why −log₁₀? · scaling tiny numbers

p-value−log₁₀(p)Interpretation
0.051.3Conventional · noise at GWAS scale
10⁻⁵5.0Suggestive
5 × 10⁻⁸7.3Genome-wide significant
10⁻¹⁰10.0Strongly significant
10⁻²⁰20.0Massive — large effect or huge N

A real Manhattan plot · DCM

Manhattan plot from a multi-ancestry GWAS of dilated cardiomyopathy
Figure 1. Manhattan plot from a dilated cardiomyopathy (DCM) GWAS. 14,256 cases · 1,199,156 controls · 80 genome-wide significant loci. Red = newly discovered loci, orange = previously reported. Zheng et al. 2024, Nature Genetics.

How to read a peak

  • A peak = many neighboring SNPs all significant
  • The top SNP in a peak is called the lead SNP
  • Lead SNP is reported · but rarely the causal variant itself
  • Width of the peak = extent of LD in that region
§ 4

The QQ Plot —
Are Hits Real?

The idea · expected vs observed

  • If no SNP truly affected the trait → p-values are uniform on [0, 1]
  • Plot expected −log₁₀(p) on x-axis · observed −log₁₀(p) on y-axis
  • All-null world: points fall on the diagonal y = x
  • Real signals: tail rises above diagonal — only at the right edge

What a good QQ plot looks like

Healthy

  • Bulk on diagonal
  • Tail rises sharply at far right
  • Genomic inflation λ ≈ 1

Confounded

  • Whole distribution shifted up
  • Inflation across the board
  • λ ≫ 1 → population stratification

Population stratification · the classic trap

  • Cases and controls drawn from different ancestries
  • Allele frequencies naturally differ between populations
  • Result: fake trait association at every ancestry-divergent SNP
  • Fix: include principal components or linear mixed models

Manhattan + QQ · the standard pair

Manhattan tells you where the signals are.
QQ tells you whether they are real.
  • Every GWAS paper shows both
  • If the QQ is bad, the Manhattan is meaningless
  • Read them together, in that order
§ 5

Linkage Disequilibrium
& Lead SNPs

What is linkage disequilibrium?

  • Variants that are physically close on a chromosome
  • ...tend to be inherited together across generations
  • Their alleles are correlated across people
  • Genome partitions into LD blocks — regions of high correlation

Why peaks have width

  • If the causal SNP is associated with the trait...
  • ...every SNP in LD with it is also associated
  • The whole LD block lights up — that is the peak
  • Lead SNP is just the top tag, not necessarily the cause
GWAS hits a region, not a variant.

Lead SNP ≠ causal SNP

What you see

  • Lead SNP: smallest p in the peak
  • Reported in the paper
  • Usually non-coding

What it means

  • "Causal variant lives somewhere in this LD block"
  • Block can span 10 kb to 1 Mb
  • Often contains multiple genes

Effect sizes · how big is small?

Variant typeTraitEffect size
Mendelian (e.g. BRCA1)Breast cancerOR ≈ 5–10
Common GWAS hitHeart diseaseOR ≈ 1.05–1.20
Common GWAS hitHeightβ ≈ 0.1–0.5 cm
Top FTO variantBMIβ ≈ 0.4 kg/m²

Common variants: tiny individually, but thousands of them.

§ 6

Biobanks &
The Case Study

UK Biobank · the workhorse

  • ~500,000 participants in the UK
  • Genotyped on a ~800,000-SNP array · imputed to ~90 million
  • Linked to health records, hospital codes, deaths, imaging
  • Powers thousands of GWAS papers — every year

Why scale matters · power

  • Effect sizes are tiny → need massive N to detect them
  • Doubling N → roughly more genome-wide hits (rough scaling)
  • 2007 height GWAS: 20 hits · 2022 height GWAS: ~12,000 hits
  • The whole field is N-limited

The ancestry problem

  • ~80% of GWAS participants are of European ancestry
  • LD patterns & allele frequencies differ across populations
  • European-trained results transfer poorly to other groups
  • Solution: trans-ancestry GWAS — diverse cohorts

Case study · DCM in 1.2 million people

  • Dilated cardiomyopathy (DCM): heart muscle weakens & enlarges
  • 14,256 cases · 1,199,156 controls · multi-ancestry
  • 80 loci reach 5 × 10⁻⁸
  • Genes near hits: MAP3K7, NEDD4L, SSPN
  • Hits cluster in muscle structure, cell adhesion, ECM pathways

Functional convergence · 80 loci, 3 themes

Pathway and cell-type enrichment of DCM-associated genes
Figure 2. Eighty scattered DCM loci converge on a small number of pathways — sarcomere structure, cell adhesion, extracellular matrix — and on specific cell types (cardiomyocytes, fibroblasts). Zheng et al. 2024, Nature Genetics.

From GWAS to risk prediction

DCM polygenic score predicts disease in UK Biobank
Figure 3. A polygenic score built from the 80 DCM loci, applied to UK Biobank: top 10% of PGS has ~2.8× the disease risk of the bottom 10%. PGS modifies penetrance even in carriers of rare pathogenic variants. Zheng et al. 2024, Nature Genetics.
§ 7

What GWAS Gives —
and What It Doesn't

What GWAS gives

  • An unbiased scan of common variation
  • A list of genomic loci (not exact variants)
  • Effect sizes and direction for each lead SNP
  • Pathways and cell types via downstream analysis
  • Polygenic scores for risk stratification

What GWAS doesn't give

  • The causal variant within an LD block
  • The mechanism by which a variant acts
  • Effects of rare variants (MAF < 1%)
  • Non-additive interactions (epistasis, dominance)
  • Reliable predictions in under-sampled ancestries
§ 8

Summary

What to take away

  • GWAS = scan millions of SNPs for trait association
  • Threshold: 5 × 10⁻⁸ · Bonferroni for ~1M independent tests
  • Manhattan = where signals are · QQ = whether they're real
  • Lead SNP ≠ causal SNP — blame linkage disequilibrium
  • Biobanks (UK Biobank, ~500k) made the modern era possible
Next lecture

GWAS gives us hits.
But what shape
does the architecture take?

Chapter 19 · Genetic Architecture of Complex Traits