BSMS205 · Genetics

From Single Reference
to Pangenome

Chapter 4 · Part I · The Human Genome
A question to start with

We finally have a complete
reference. Is one enough?

How different are two human genomes?

  • Millions of single-letter differences — SNPs
  • Larger chunks inserted, deleted, duplicated, flipped — SVs
  • Gene copy numbers vary between individuals
  • Some sequences exist in some populations only

Two stories that show the cost

Drug response

  • East Asian population insertion
  • Changes protein shape
  • Missing from reference
  • Why does the drug behave differently?

Diagnosis

  • African ancestry patient
  • Large deletion in a gene
  • Disease — or normal variant?
  • Single reference cannot say

Roadmap for today

  1. Why one reference is not enough
  2. What a pangenome is — graph vs line
  3. The HPRC · 47 individuals · 94 haplotypes
  4. What the pangenome revealed
  5. Real-world impact & equity
  6. Limits and what comes next
§ 1

One Reference,
Many People

Reference bias · what is it?

  • Every genome is compared to one sequence
  • Variants appear as mismatches or gaps
  • Common-in-population variants flagged as rare or pathogenic
  • The reference becomes a norm — and everyone else looks abnormal

The library, not the textbook

A single reference is one textbook.
A pangenome is a library.
  • Different books explain the same concept differently
  • Some include chapters others lack
  • Consult many → fuller picture

Why one reference is not enough

Single reference vs multiple reference alignment
Figure 1. Aligning many people to one reference (left) makes diverse variants appear as mismatches. With multiple references (right), each person finds at least one good match.
§ 2

What Is
a Pangenome?

Pangenome · a definition

A collection of multiple complete, high-quality genomes from diverse individuals, designed to capture the full range of variation in a population.
  • Many references, not one
  • Each is complete and high quality
  • Chosen to cover ancestry diversity

Linear reference vs graph

Linear (old)

  • One path through DNA
  • Variants = deviations
  • One sequence is "correct"

Graph (new)

  • Many paths through same region
  • Branches where sequences diverge
  • Converge where they agree
  • No single "correct" path

Linear vs pangenome graph

Linear reference vs pangenome graph representation
Figure 2. Top: a linear reference forces every variant onto one sequence. Bottom: a pangenome graph offers multiple paths — branches diverge where haplotypes differ, converge where they agree.
A short worked example

Three haplotypes, one region

  • Hap A: ...ACGTAAACCCCGGTA...
  • Hap B: ...ACGTAAAGGTA... (deletion)
  • Hap C: ...ACGTAAACCCCCCCCGGTA... (expansion)

Linear ref = one of these. Graph = all three paths.

§ 3

The HPRC
2023 Draft

HPRC · the 2023 release

47
individuals · diverse ancestry
94
phased haplotypes (diploid × 2)

Liao et al. 2023 · Nature 617, 312–324

Who is in the pangenome?

Population groupShareExamples
African51%Yoruba · Gambian · Mende
American34%Puerto Rican · Peruvian · Colombian · Mexican
Asian13%Han Chinese · Japanese · Punjabi · Bengali
European2%British · Finnish · Iberian

A deliberate inversion of the GRCh38 ancestry skew.

Genetic diversity captured

HPRC ancestry distribution from Liao 2023 Figure 1
Figure 3. Each dot = one individual; closer dots are more genetically similar. The 47 samples span major populations · Liao et al. 2023, Nature.

Why diploid means harder

  • CHM13 was uniparental · two identical haplotypes
  • Real people are diploid · two different haplotypes
  • Assembly must phase: separate maternal from paternal
  • HPRC delivered phased assemblies for all 47

How was it built?

TechnologyRead lengthRole
PacBio HiFi~20 kbLong & accurate
Oxford Nanopore>100 kbUltra-long · spans repeats
Hi-C3D folding · haplotype phasing
Illumina~150 bpShort · error correction
§ 4

What the
Pangenome Revealed

We were missing a lot of DNA

119 Mb
DNA absent from GRCh38
  • ~4% more sequence than the single reference
  • Includes genes and regulatory regions
  • 90 Mb of it is structural variation

What was hiding in those 119 Mb?

  • Gene sequences · complete genes missing from GRCh38
  • Regulatory regions · controls for expression
  • Structural variants · large insertions, deletions, inversions
  • Population-specific sequences

Genes come in different copy numbers

  • 1,115 gene duplications vary across individuals
  • Some people have 1–2 copies, others have 20+
  • The single reference assumed everyone matched one count
The reality is much more complex than a single reference can show.

Why does copy number matter?

AMY1 · starch digestion

  • Up to 15+ copies in starch-eating populations
  • 2–4 copies in others

Immune & drug genes

  • More copies → stronger response
  • Fewer copies → may reduce autoimmunity
  • Drug-metabolism genes affect dose

Copy number across populations

Liao 2023 Figure 2 panels b and c, gene copy number variation by population
Figure 4. Most genes appear in 1–2 copies, but some reach 20+. Patterns differ by population (AMR · AFR · EAS · SAS · CHM13) · Liao et al. 2023, Nature.

Better detection of structural variants

  • Deletions — chunk of DNA missing
  • Insertions — extra DNA added
  • Duplications — region copied
  • Inversions — segment flipped backwards

SVs affect more total DNA than all SNPs combined.

Pangenome accuracy gain

−34%
errors in small variants (SNPs & small indels)
+104%
improvement in detecting structural variants

More than doubling SV accuracy.

Why the gain · in one line

With 94 haplotypes in the panel,
a population variant probably matches one of them.
§ 5

Real-World Impact
& Equity

Clinical case · before vs after

Before · single reference

  • African-ancestry patient
  • 50-kb deletion in a gene
  • Flagged: "likely pathogenic"
  • Worry · invasive testing

After · pangenome

  • Same deletion seen in 30% of African haplotypes
  • Reclassified: "benign · common"
  • Patient spared

From bias to equity

Reference bias reduction across populations
Figure 5. Single reference (left) wrongly flags population variants as pathogenic. Pangenome (right) compares against many ancestries → fewer false positives, more equitable diagnosis.

Drug development

  • Trials enroll diverse populations
  • But analyses use a single reference → variants missed
  • Pangenome enables:
    • Better dose recommendations per population
    • Understanding population-specific side effects
    • Identifying who benefits most

Disease research wins

  • Autism spectrum · often duplications & deletions
  • Schizophrenia · associated with copy number variants
  • Cardiovascular · some forms linked to lipid-gene SVs

SV-driven diseases benefit most from a pangenome.

§ 6

Limits &
What's Next

What is still challenging

  • The most repetitive ~4.4% of the genome
  • Centromeric satellite arrays
  • Ribosomal DNA arrays
  • Some heterochromatic regions

Not a failure — it's genuine biological complexity.

The roadmap · 47 → 350

350
target individuals
700
target haplotypes
  • Cover more rare variants
  • Include more under-represented populations
  • Enable finer sub-population studies

The evolution · references compared

ReferenceYearCoverageLimit
GRCh382013~95%~151 Mb gaps · Euro-skewed
T2T-CHM132022100%Single source
HPRC draft2023>99% per hap47 individuals · some repeat gaps
HPRC future~2026–27>99% per hap700 hap · most repeats still hard

Different references · different jobs

  • Genome structure → T2T-CHM13 (gapless)
  • Clinical diagnostics → pangenome (reduces false positives)
  • Population genetics → pangenome (captures variation)
  • Evolution vs primates → human + primate pangenomes
The future is not one reference — it's the right reference for your question.
§ 7

Summary

What to take away

  • One reference → reference bias against diverse populations
  • Pangenome = many complete genomes · graph representation
  • HPRC 2023 · 47 individuals · 94 haplotypes · Liao et al., Nature
  • Revealed: 119 Mb missing DNA · 1,115 CNV genes
  • Variant accuracy: −34% errors · +104% SV detection
  • Equity: a fairer denominator for clinical decisions
Next lecture

To do all of this,
we needed cheap sequencing.

Chapter 5 · Next-Generation Sequencing Technologies