BSMS205 · Genetics

Population Structure

Chapter 21 · Part IV · Population Genetics

Today's central question

Why do the same variants
have different frequencies
in different populations?

Variation is not randomly distributed

Same variant · common in one group · rare in another
Patterns are gradients, not sharp boundaries
History has left traces in our genomes
And we can read them

Why this matters

Scientifically

GWAS false positives if ancestry is not controlled
Pathogenic vs benign depends on population baseline

Ethically

Population &neq; "race"
Differences are gradients, not divisions

Roadmap for today

What is population structure?
Race · ethnicity · ancestry · population
How structure arises (five mechanisms)
Detecting it with PCA and UMAP
Concrete selection examples
Quantifying with F_ST
Why it matters in practice

§ 1

What Is
Population Structure?

A precise definition

Population structure = systematic differences in allele frequencies among groups within a species.

Detectable, but subtle
Most variation is within populations, not between
Not about categorising humans — about reading history

The out-of-Africa context

Humans evolved in Africa
Migration out began roughly sixty to one hundred thousand years ago
Geographic separation limited gene flow
Once gene flow is limited, populations diverge

Three forces drive divergence

Mutation — new variants arise independently in each population
Drift — random frequency changes, stronger in smaller populations
Selection — different environments favour different alleles

The result: smooth gradients · never sharp boundaries.

§ 2

Race · Ethnicity ·
Ancestry · Population

Four terms · very different meanings

Concept	What it is	Nature
Race	Socially defined by perceived traits	Social construct
Ethnicity	Cultural identity (language, tradition)	Cultural
Ancestry	Genetic lineage · where ancestors lived	Biological · probabilistic
Population	Group with shared gene flow	Biological · operational

Why "race" is not a genetic category

No set of genes cleanly separates humans into "races"
Skin colour is controlled by a handful of genes
Two people of the same "race" can be more different genetically
than either is from someone of a different "race"

What geneticists use instead

Ancestry

Probabilistic
50% European / 30% East Asian / 20% African
From allele frequency patterns

Population

Operational
Defined by gene flow
Used to correct GWAS bias

§ 3

How Structure
Arises

Geographic isolation and drift

Oceans, deserts, mountains limit gene flow
Once gene flow drops, drift accumulates differences
Longer isolation → more divergence
Even identical starting populations drift apart

Founder effects

New population from a small founding group
Carries only a fraction of ancestral diversity
Some alleles absent · others over-represented by chance
Reduced diversity persists as the population grows

Case study · Finnish Disease Heritage

Population descended from ~four thousand founders, around four thousand years ago
Seventeenth-century famine → one-third of population lost
Thirty-six recessive disorders enriched in Finland
Each disease: a single founder mutation drifted to high frequency

Peltonen et al. 1999 · Norio 2003

Population bottlenecks

Population size drops sharply — disease, famine, migration
Rare alleles lost by chance during the squeeze
Recovered population has reduced diversity
Genomic signature: fewer rare variants, longer LD blocks

Admixture · the mosaic genome

Previously separated populations interbreed
Offspring inherit blocks of each ancestry
Modern examples: African Americans, Latino populations
A recent signature · easily detected in modern genomes

Selection and local adaptation

Different environments favour different alleles
Strong local selection → sharp frequency differences
Three iconic examples today:

Lactose tolerance in dairying populations
Sickle cell in malaria zones
Altitude adaptation in Tibetans

Fifth and final mechanism: natural selection creating local adaptation. Different environments favour different alleles. When selection is strong and local, it can drive sharp frequency differences between populations — much sharper than drift alone would create. We will spend time on three textbook examples later in this lecture: lactose tolerance in dairying populations, the sickle cell allele in malaria regions, and high-altitude adaptation in Tibetan highlanders. Each one shows selection reshaping a specific part of the genome in response to a specific environmental pressure. Together, these five mechanisms — isolation with drift, founder effects, bottlenecks, admixture, and local selection — are the full toolkit that created the pattern of human genetic variation we see today.

§ 4

Detecting Structure
with PCA and UMAP

The data problem

Each person: millions of SNPs
Each SNP is a dimension
We cannot visualise millions of dimensions
Need to compress to two or three dimensions

PCA · Principal Component Analysis

Finds new axes that capture the most variance
PC1: direction of largest genetic spread
PC2: next largest, perpendicular to PC1
Linear — fast, interpretable, widely used

What PCA shows in humans

PC1 typically separates African vs non-African ancestry
(the oldest and deepest human split)
PC2 typically separates European vs East Asian
PC3+ captures finer substructure within continents

UMAP · when PCA is too linear

PCA is linear · finds straight-line directions
Real structure is sometimes curved
UMAP preserves both global and local structure
Better for seeing fine-scale sub-populations

gnomAD · human diversity on one figure

gnomAD UMAP showing global human genetic diversity — gnomAD v3.1 · ~141,000 genomes · UMAP projection. Clusters are distinct but **blend at the edges** — continuous gradients, not sharp boundaries.
Source: gnomAD Broad Institute.

This is the figure I want you to remember from this chapter. It shows the U MAP projection of roughly one hundred forty-one thousand human genomes from gnomAD. Each dot is one individual, placed in two dimensions based on millions of genetic variants. You can see the major ancestry groups as coloured clusters — African in purple, European in blue, East Asian in pink, South Asian in green, Latino in orange. Three things to notice. First, the clusters are clearly distinct. Structure is real. Second, the clusters blend at the edges — admixed individuals occupy the space between groups, showing that boundaries are continuous, not sharp. Third, the African cluster is by far the most spread out. That is because African populations are the oldest and have accumulated the most genetic diversity. The non-African clusters are tighter because they all descend from the smaller founder groups that left Africa. One picture, a complete summary of human genetic history.

Why this tooling matters

Used as GWAS covariates — corrects ancestry bias
Reveals admixture and migration patterns
Demonstrates genetic variation is continuous, not categorical

§ 5

Selection in Action ·
Three Examples

Example 1 · Lactose tolerance

Variant rs4988235 upstream of LCT gene
Keeps lactase production active into adulthood
Northern Europe: seventy to ninety percent
East Asia: under ten percent
Under selection for ~seven thousand five hundred years

Tishkoff et al. 2007 · Bersaglieri et al. 2004

Example 2 · Sickle cell and malaria

HbS allele in HBB gene · single amino acid change
Sub-Saharan Africa (malaria zones): ten to twenty percent
Outside malaria zones: nearly absent
Heterozygotes: protection against falciparum malaria
Homozygotes: sickle cell disease

Piel et al. 2010 · Gong et al. 2015

Our second example is the sickle cell allele. This is a single amino acid change in the H B B gene, encoding the beta chain of haemoglobin. In sub-Saharan African populations where falciparum malaria is endemic, the sickle cell allele reaches frequencies of ten to twenty percent. Outside malaria zones, it is nearly absent. The reason is a classic case of balanced selection. Heterozygotes — people with one copy of the sickle cell allele and one copy of the normal allele — gain substantial protection against severe malaria. Homozygotes — people with two copies — suffer from sickle cell disease, which historically was fatal before modern medicine. So the allele is maintained at intermediate frequency by the opposing forces of malaria protection in heterozygotes and disease in homozygotes. The geographical distribution of the sickle cell allele closely mirrors the historical distribution of malaria — a beautiful confirmation of the balanced selection model.

Example 3 · Tibetan altitude adaptation

Living above four thousand metres — severe oxygen scarcity
Variants in EPAS1 and EGLN1 under strong selection
Lower haemoglobin → avoids polycythemia
EPAS1 haplotype: ~87% Tibetans vs ~9% Han Chinese
Inherited from Denisovans — archaic admixture

Beall 2010 · Yi 2010 · Huerta-Sánchez 2014

Our third example is Tibetan high-altitude adaptation. Tibetan highlanders live routinely above four thousand metres elevation, where oxygen is scarce. Variants in the genes E P A S one and E G L N one show the strongest selection signals in the Tibetan genome. These variants result in lower haemoglobin concentration in Tibetans, which paradoxically protects them from polycythemia — the dangerous overproduction of red blood cells that causes altitude sickness in lowlanders. The striking number: the E P A S one adaptive haplotype is present in about eighty-seven percent of Tibetans but only about nine percent of Han Chinese, who share very recent common ancestry. And here is the twist. That E P A S one haplotype did not arise by new mutation in Tibetans. It was inherited from Denisovans, an archaic hominin species, through admixture tens of thousands of years ago. This is called adaptive introgression — borrowing a useful allele from another species. It means population structure, admixture, and selection all interact in the story of how humans adapted to new environments.

§ 6

Quantifying Difference:
F_ST

The Fixation Index

F_ST measures how much allele frequencies
differ between populations relative to total variation.

F_ST = 0 · populations identical
F_ST = 1 · populations completely different

The striking human number

0.05 – 0.15

typical F_ST between continental groups

85 – 95% of human genetic variation exists within populations,
not between them.

F_ST varies across the genome

Most regions: low F_ST — similar across populations
Some regions: high F_ST — differentiated by selection
Example: skin pigmentation genes between African and European populations
High-F_ST scans → candidate regions of local adaptation

§ 7

Why It Matters
in Practice

Correcting GWAS · the main practical use

Cases and controls with different ancestry ratios → false positives
Any ancestry-differentiated variant looks associated
Fix: include leading PCs as covariates
Without correction: GWAS is flooded with spurious hits

Other applications

Tracing human history — migration, admixture, ancient splits
Variant interpretation — common-in-ancestry = likely benign
Equity — polygenic scores trained on one ancestry fail in others
Multi-ancestry cohorts → accuracy and fairness

Beyond G W A S correction, population structure has three other major applications. First, tracing human history. Migration routes, admixture events, and ancient population splits are all readable from modern genomes. Population genetics is an archaeological tool. Second, variant interpretation in clinical genetics. A variant that is common in healthy individuals of one ancestry is almost certainly benign, even if it is rare in the global population. Without population-specific frequency data, you will misinterpret variants. Third, equity. Most early G W A S and polygenic score studies were conducted on Europeans. Those scores do not transfer well to other ancestries. Expanding to multi-ancestry cohorts improves both statistical accuracy and fairness — ensuring that people of all ancestries benefit from genomic medicine.

§ 8

Summary

What to take away

Population structure = systematic frequency differences among groups
It arises from drift, founder effects, bottlenecks, admixture, selection
PCA / UMAP show variation is continuous, not categorical
F_ST for continental groups: 0.05 – 0.15 — most variation is within
Race &neq; population. Ancestry ≠ race.

Next lecture

How do variants travel together
on chromosomes?

Chapter 22 · From Mendel to Morgan — Discovery of Linkage