BSMS205 · Genetics

Introduction to
GWAS

Chapter 18 · Part III · Complex Traits

A question to start with

We know genes shape height.
But which genes?

The needle-in-a-haystack problem

The genome has ~3 billion base pairs
~10 million common variants segregate in humans
Most have tiny effect on any one trait
Pre-GWAS methods could only chase large-effect Mendelian genes

Need: a way to scan the whole genome, unbiased, at once.

The core idea

Compare allele frequencies
in cases vs controls —
for millions of SNPs at once.

What we cover today

What GWAS is — and what it tests
Why the threshold is 5 × 10⁻⁸
How to read a Manhattan plot
How to read a QQ plot
Linkage disequilibrium · lead SNP ≠ causal SNP
Biobanks · ancestry · the DCM case study
What it gives us — and what it doesn't

§ 1

What
GWAS Is

SNPs · the alphabet of variation

SNP = Single Nucleotide Polymorphism
One-letter difference at a fixed genomic position
Common: minor allele frequency > 1%
Not the rare disease-causing mutations of Mendelian disorders

The everyday genetic differences between people.

Two flavors of GWAS

Case–control

People with vs without a disease
Test allele frequency difference
Effect size = odds ratio (OR)
e.g. heart disease, schizophrenia

Quantitative trait

Continuous measure across people
Test allele effect on the value
Effect size = beta (β)
e.g. height, BMI, cholesterol

How do we measure SNPs?

SNP arrays

~500,000 to 1 million SNPs probed
Cheap: ~$30–50 per sample
Imputation infers the rest

Whole-genome sequencing

Reads every base
~$200–1,000 per sample
Captures rare variants too

Most GWAS still use arrays + imputation — biobank-scale.

The single test, repeated millions of times

For every SNP i: test whether allele frequency differs between cases and controls
Output: a p-value — between 0 and 1
Small p = data unlikely if SNP had no effect
That is all a single GWAS test does

§ 2

Why
5 × 10⁻⁸?

The multiple testing problem

Test 1 SNP at p < 0.05: 5% false-positive rate — fine
Test 1,000,000 SNPs at p < 0.05: expect 50,000 false positives
You'd find "associations" where there are none
The naïve threshold is useless at genome scale

The Bonferroni correction

α_genome-wide = α / N_tests = 0.05 / 1,000,000 = 5 × 10⁻⁸

~1 million independent common-variant tests in the genome
Divide α = 0.05 by 1,000,000 → keeps total false-positive rate ≈ 5%
Conservative · but reproducible across studies

The threshold

5 × 10⁻⁸

genome-wide significance · p-value

= 0.00000005 — five in one hundred million
= −log₁₀(p) ≈ 7.3 on the y-axis
Universal across modern GWAS

What "significant" actually means

A p < 5 × 10⁻⁸ result rejects the null
— it does not prove causation.

The SNP itself may not be the causal variant
Could be a nearby variant in linkage disequilibrium
Could reflect population stratification
Statistical association ≠ biological mechanism

§ 3

The
Manhattan Plot

The skyline of significance

x-axis: chromosomal position · chr 1 → chr 22
y-axis: −log₁₀(p) for each SNP
Each dot = one SNP
Dashed line at y ≈ 7.3 = the 5 × 10⁻⁸ threshold

Peaks above the line look like the Manhattan skyline.

Why −log₁₀? · scaling tiny numbers

p-value	−log₁₀(p)	Interpretation
0.05	1.3	Conventional · noise at GWAS scale
10⁻⁵	5.0	Suggestive
5 × 10⁻⁸	7.3	Genome-wide significant
10⁻¹⁰	10.0	Strongly significant
10⁻²⁰	20.0	Massive — large effect or huge N

A real Manhattan plot · DCM

Manhattan plot from a multi-ancestry GWAS of dilated cardiomyopathy — **Figure 1.** Manhattan plot from a dilated cardiomyopathy (DCM) GWAS. **14,256 cases** · **1,199,156 controls** · **80 genome-wide significant loci**. Red = newly discovered loci, orange = previously reported. *Zheng et al. 2024, Nature Genetics.*

How to read a peak

A peak = many neighboring SNPs all significant
The top SNP in a peak is called the lead SNP
Lead SNP is reported · but rarely the causal variant itself
Width of the peak = extent of LD in that region

§ 4

The QQ Plot —
Are Hits Real?

The idea · expected vs observed

If no SNP truly affected the trait → p-values are uniform on [0, 1]
Plot expected −log₁₀(p) on x-axis · observed −log₁₀(p) on y-axis
All-null world: points fall on the diagonal y = x
Real signals: tail rises above diagonal — only at the right edge

What a good QQ plot looks like

Healthy

Bulk on diagonal
Tail rises sharply at far right
Genomic inflation λ ≈ 1

Confounded

Whole distribution shifted up
Inflation across the board
λ ≫ 1 → population stratification

Population stratification · the classic trap

Cases and controls drawn from different ancestries
Allele frequencies naturally differ between populations
Result: fake trait association at every ancestry-divergent SNP
Fix: include principal components or linear mixed models

Manhattan + QQ · the standard pair

Manhattan tells you where the signals are.
QQ tells you whether they are real.

Every GWAS paper shows both
If the QQ is bad, the Manhattan is meaningless
Read them together, in that order

§ 5

Linkage Disequilibrium
& Lead SNPs

What is linkage disequilibrium?

Variants that are physically close on a chromosome
...tend to be inherited together across generations
Their alleles are correlated across people
Genome partitions into LD blocks — regions of high correlation

Why peaks have width

If the causal SNP is associated with the trait...
...every SNP in LD with it is also associated
The whole LD block lights up — that is the peak
Lead SNP is just the top tag, not necessarily the cause

GWAS hits a region, not a variant.

Lead SNP ≠ causal SNP

What you see

Lead SNP: smallest p in the peak
Reported in the paper
Usually non-coding

What it means

"Causal variant lives somewhere in this LD block"
Block can span 10 kb to 1 Mb
Often contains multiple genes

Effect sizes · how big is small?

Variant type	Trait	Effect size
Mendelian (e.g. BRCA1)	Breast cancer	OR ≈ 5–10
Common GWAS hit	Heart disease	OR ≈ 1.05–1.20
Common GWAS hit	Height	β ≈ 0.1–0.5 cm
Top FTO variant	BMI	β ≈ 0.4 kg/m²

Common variants: tiny individually, but thousands of them.

§ 6

Biobanks &
The Case Study

UK Biobank · the workhorse

~500,000 participants in the UK
Genotyped on a ~800,000-SNP array · imputed to ~90 million
Linked to health records, hospital codes, deaths, imaging
Powers thousands of GWAS papers — every year

Why scale matters · power

Effect sizes are tiny → need massive N to detect them
Doubling N → roughly 4× more genome-wide hits (rough scaling)
2007 height GWAS: 20 hits · 2022 height GWAS: ~12,000 hits
The whole field is N-limited

The ancestry problem

~80% of GWAS participants are of European ancestry
LD patterns & allele frequencies differ across populations
European-trained results transfer poorly to other groups
Solution: trans-ancestry GWAS — diverse cohorts

Case study · DCM in 1.2 million people

Dilated cardiomyopathy (DCM): heart muscle weakens & enlarges
14,256 cases · 1,199,156 controls · multi-ancestry
80 loci reach 5 × 10⁻⁸
Genes near hits: MAP3K7, NEDD4L, SSPN
Hits cluster in muscle structure, cell adhesion, ECM pathways

Functional convergence · 80 loci, 3 themes

Pathway and cell-type enrichment of DCM-associated genes — **Figure 2.** Eighty scattered DCM loci converge on a small number of pathways — sarcomere structure, cell adhesion, extracellular matrix — and on specific cell types (cardiomyocytes, fibroblasts). *Zheng et al. 2024, Nature Genetics.*

From GWAS to risk prediction

DCM polygenic score predicts disease in UK Biobank — **Figure 3.** A polygenic score built from the 80 DCM loci, applied to UK Biobank: top 10% of PGS has **~2.8× the disease risk** of the bottom 10%. PGS modifies penetrance even in carriers of rare pathogenic variants. *Zheng et al. 2024, Nature Genetics.*

§ 7

What GWAS Gives —
and What It Doesn't

What GWAS gives

An unbiased scan of common variation
A list of genomic loci (not exact variants)
Effect sizes and direction for each lead SNP
Pathways and cell types via downstream analysis
Polygenic scores for risk stratification

What GWAS doesn't give

The causal variant within an LD block
The mechanism by which a variant acts
Effects of rare variants (MAF < 1%)
Non-additive interactions (epistasis, dominance)
Reliable predictions in under-sampled ancestries

§ 8

Summary

What to take away

GWAS = scan millions of SNPs for trait association
Threshold: 5 × 10⁻⁸ · Bonferroni for ~1M independent tests
Manhattan = where signals are · QQ = whether they're real
Lead SNP ≠ causal SNP — blame linkage disequilibrium
Biobanks (UK Biobank, ~500k) made the modern era possible

Five takeaways. One — GWAS scans millions of SNPs across the genome and tests each one for association with a trait, comparing cases versus controls or measuring quantitative effects. Two — the genome-wide significance threshold is five times ten to the minus eight, which comes from the Bonferroni correction for about one million independent common-variant tests. Three — the Manhattan plot tells you where the signals are; the QQ plot tells you whether they are real and not artifacts of population stratification. Four — the lead SNP at the top of a peak is rarely the variant that does the biology, because linkage disequilibrium drags whole regions along. Five — biobanks like UK Biobank, with about half a million participants, are what made today's GWAS scale possible.

Next lecture

GWAS gives us hits.
But what shape
does the architecture take?

Chapter 19 · Genetic Architecture of Complex Traits