BSMS205 · Genetics

From Single Reference
to Pangenome

Chapter 4 · Part I · The Human Genome

A question to start with

We finally have a complete
reference. Is one enough?

How different are two human genomes?

Millions of single-letter differences — SNPs
Larger chunks inserted, deleted, duplicated, flipped — SVs
Gene copy numbers vary between individuals
Some sequences exist in some populations only

Two stories that show the cost

Drug response

East Asian population insertion
Changes protein shape
Missing from reference
Why does the drug behave differently?

Diagnosis

African ancestry patient
Large deletion in a gene
Disease — or normal variant?
Single reference cannot say

Roadmap for today

Why one reference is not enough
What a pangenome is — graph vs line
The HPRC · 47 individuals · 94 haplotypes
What the pangenome revealed
Real-world impact & equity
Limits and what comes next

§ 1

One Reference,
Many People

Reference bias · what is it?

Every genome is compared to one sequence
Variants appear as mismatches or gaps
Common-in-population variants flagged as rare or pathogenic
The reference becomes a norm — and everyone else looks abnormal

The library, not the textbook

A single reference is one textbook.
A pangenome is a library.

Different books explain the same concept differently
Some include chapters others lack
Consult many → fuller picture

Why one reference is not enough

Single reference vs multiple reference alignment — **Figure 1.** Aligning many people to *one* reference (left) makes diverse variants appear as mismatches. With *multiple* references (right), each person finds at least one good match.

§ 2

What Is
a Pangenome?

Pangenome · a definition

A collection of multiple complete, high-quality genomes from diverse individuals, designed to capture the full range of variation in a population.

Many references, not one
Each is complete and high quality
Chosen to cover ancestry diversity

Linear reference vs graph

Linear (old)

One path through DNA
Variants = deviations
One sequence is "correct"

Graph (new)

Many paths through same region
Branches where sequences diverge
Converge where they agree
No single "correct" path

Linear vs pangenome graph

Linear reference vs pangenome graph representation — **Figure 2.** Top: a linear reference forces every variant onto one sequence. Bottom: a pangenome graph offers multiple paths — branches diverge where haplotypes differ, converge where they agree.

A short worked example

Three haplotypes, one region

Hap A: ...ACGTAAACCCCGGTA...
Hap B: ...ACGTAAA—GGTA... (deletion)
Hap C: ...ACGTAAACCCCCCCCGGTA... (expansion)

Linear ref = one of these. Graph = all three paths.

§ 3

The HPRC
2023 Draft

HPRC · the 2023 release

individuals · diverse ancestry

phased haplotypes (diploid × 2)

Liao et al. 2023 · Nature 617, 312–324

Who is in the pangenome?

Population group	Share	Examples
African	51%	Yoruba · Gambian · Mende
American	34%	Puerto Rican · Peruvian · Colombian · Mexican
Asian	13%	Han Chinese · Japanese · Punjabi · Bengali
European	2%	British · Finnish · Iberian

A deliberate inversion of the GRCh38 ancestry skew.

Genetic diversity captured

HPRC ancestry distribution from Liao 2023 Figure 1 — **Figure 3.** Each dot = one individual; closer dots are more genetically similar. The 47 samples span major populations · Liao et al. 2023, *Nature*.

Why diploid means harder

CHM13 was uniparental · two identical haplotypes
Real people are diploid · two different haplotypes
Assembly must phase: separate maternal from paternal
HPRC delivered phased assemblies for all 47

How was it built?

Technology	Read length	Role
PacBio HiFi	~20 kb	Long & accurate
Oxford Nanopore	>100 kb	Ultra-long · spans repeats
Hi-C	—	3D folding · haplotype phasing
Illumina	~150 bp	Short · error correction

§ 4

What the
Pangenome Revealed

We were missing a lot of DNA

119 Mb

DNA absent from GRCh38

~4% more sequence than the single reference
Includes genes and regulatory regions
90 Mb of it is structural variation

What was hiding in those 119 Mb?

Gene sequences · complete genes missing from GRCh38
Regulatory regions · controls for expression
Structural variants · large insertions, deletions, inversions
Population-specific sequences

Genes come in different copy numbers

1,115 gene duplications vary across individuals
Some people have 1–2 copies, others have 20+
The single reference assumed everyone matched one count

The reality is much more complex than a single reference can show.

Why does copy number matter?

AMY1 · starch digestion

Up to 15+ copies in starch-eating populations
2–4 copies in others

Immune & drug genes

More copies → stronger response
Fewer copies → may reduce autoimmunity
Drug-metabolism genes affect dose

Copy number across populations

Liao 2023 Figure 2 panels b and c, gene copy number variation by population — **Figure 4.** Most genes appear in 1–2 copies, but some reach 20+. Patterns differ by population (AMR · AFR · EAS · SAS · CHM13) · Liao et al. 2023, *Nature*.

Better detection of structural variants

Deletions — chunk of DNA missing
Insertions — extra DNA added
Duplications — region copied
Inversions — segment flipped backwards

SVs affect more total DNA than all SNPs combined.

Pangenome accuracy gain

−34%

errors in small variants (SNPs & small indels)

+104%

improvement in detecting structural variants

More than doubling SV accuracy.

Why the gain · in one line

With 94 haplotypes in the panel,
a population variant probably matches one of them.

§ 5

Real-World Impact
& Equity

Clinical case · before vs after

Before · single reference

African-ancestry patient
50-kb deletion in a gene
Flagged: "likely pathogenic"
Worry · invasive testing

After · pangenome

Same deletion seen in 30% of African haplotypes
Reclassified: "benign · common"
Patient spared

From bias to equity

Reference bias reduction across populations — **Figure 5.** Single reference (left) wrongly flags population variants as pathogenic. Pangenome (right) compares against many ancestries → fewer false positives, more equitable diagnosis.

Drug development

Trials enroll diverse populations
But analyses use a single reference → variants missed
Pangenome enables:

Better dose recommendations per population
Understanding population-specific side effects
Identifying who benefits most

Disease research wins

Autism spectrum · often duplications & deletions
Schizophrenia · associated with copy number variants
Cardiovascular · some forms linked to lipid-gene SVs

SV-driven diseases benefit most from a pangenome.

§ 6

Limits &
What's Next

What is still challenging

The most repetitive ~4.4% of the genome
Centromeric satellite arrays
Ribosomal DNA arrays
Some heterochromatic regions

Not a failure — it's genuine biological complexity.

The roadmap · 47 → 350

350

target individuals

700

target haplotypes

Cover more rare variants
Include more under-represented populations
Enable finer sub-population studies

The evolution · references compared

Reference	Year	Coverage	Limit
GRCh38	2013	~95%	~151 Mb gaps · Euro-skewed
T2T-CHM13	2022	100%	Single source
HPRC draft	2023	>99% per hap	47 individuals · some repeat gaps
HPRC future	~2026–27	>99% per hap	700 hap · most repeats still hard

Different references · different jobs

Genome structure → T2T-CHM13 (gapless)
Clinical diagnostics → pangenome (reduces false positives)
Population genetics → pangenome (captures variation)
Evolution vs primates → human + primate pangenomes

The future is not one reference — it's the right reference for your question.

§ 7

Summary

What to take away

One reference → reference bias against diverse populations
Pangenome = many complete genomes · graph representation
HPRC 2023 · 47 individuals · 94 haplotypes · Liao et al., Nature
Revealed: 119 Mb missing DNA · 1,115 CNV genes
Variant accuracy: −34% errors · +104% SV detection
Equity: a fairer denominator for clinical decisions

Six things to take away. One — a single reference, no matter how good, builds in reference bias against diverse populations. Two — a pangenome is a collection of many complete genomes, often represented as a graph rather than a line. Three — the H P R C's twenty twenty-three draft, published by Liao and colleagues in Nature, includes forty-seven individuals and ninety-four haplotypes. Four — that draft revealed about one hundred nineteen megabases of DNA absent from GRCh thirty-eight, and one thousand one hundred fifteen genes that vary in copy number. Five — variant calling against the pangenome cuts small-variant errors by thirty-four percent and more than doubles structural variant detection. Six — and most importantly — this is the technical foundation of equity in genomic medicine. Hold these six.

Next lecture

To do all of this,
we needed cheap sequencing.

Chapter 5 · Next-Generation Sequencing Technologies