BSMS205 · Genetics

The Telomere-to-
Telomere Project

Chapter 2 · Part I · The Human Genome
A question to start with

Was the genome
really complete?

The "complete" genome wasn't

  • HGP declared complete in 2003
  • About 8% still unsequenced
  • Not random gaps — the hardest regions
  • Centromeres · telomeres · ribosomal DNA
Functionally critical, technically impossible.
The audacious goal
3,055,000,000
base pairs · every single one · telomere to telomere
  • The first truly complete human genome
  • Released in 2022 as T2T-CHM13
  • Added ~200 million bp of new sequence

Roadmap for today

  1. Why the HGP couldn't finish
  2. The technologies that changed everything
  3. CHM13 · the special cell line
  4. The X chromosome · 2020 · proof of concept
  5. What's new in T2T-CHM13 · 2022
  6. Finishing the Y chromosome · 2023
  7. Summary & what comes next
§ 1

Why couldn't
the HGP finish?

Short reads · 100–200 base pairs

  • HGP-era sequencing read tiny fragments
  • Worked great for unique DNA sequence
  • Reassembled like overlapping puzzle pieces
  • One method, used on the whole genome

Where short reads fail

Short-read sequencing fails in repetitive regions while long reads succeed
Figure 1. Short reads (100–200 bp) cannot place themselves in long repetitive regions — every repeat looks the same. Long reads (20 kb to 100+ kb) span multiple repeats in one read, anchoring the sequence uniquely.

The book analogy

Imagine a book where dozens of pages
all read: "and then they walked".
  • You can read each page
  • You cannot tell which page goes where
  • The story falls apart in the middle
  • That's the centromere problem

The four hard regions

  • Centromeres · alpha satellite arrays · 100s of kb to Mb
  • Telomeres · TTAGGG repeats at chromosome ends
  • Ribosomal DNA · 45 kb units repeated 100s of times
  • Segmental duplications · large blocks · 90–99% identical
Together: ~240 million base pairs of "dark matter".
§ 2

The technologies
that changed it

Two long-read platforms

PacBio HiFi

  • ~20,000 bp per read
  • Very high accuracy
  • Resolves most repeats

Oxford Nanopore

  • >100,000 bp per read
  • "Ultra-long" reads
  • Spans even mega-repeats

Plus four supporting methods

  • Illumina short reads · for error correction
  • Hi-C · maps how DNA folds in 3D
  • Bionano optical mapping · long-range physical map
  • Strand-seq · which strand came from which parent

String graphs · not a line

  • Old assemblies = a single linear sequence
  • T2T uses a graph: nodes = sequences, edges = overlaps
  • Like a subway map · multiple possible routes
  • Long reads pick the correct path through repeats

The string graph · seeing the tangle

T2T-CHM13 string graph showing tangled repetitive regions
Figure 2. The CHM13 string graph. Each line is a sequence; intersections show overlaps. Tangled regions = highly repetitive zones (centromeric satellites, ribosomal DNA arrays). Long reads pick the correct path through each tangle · Nurk et al. 2022, Science (preprint bioRxiv). CC-BY 4.0.

What it looks like

T2T-CHM13 assembly ideogram showing newly added regions in red
Figure 2. T2T-CHM13 assembly ideogram. Red regions = new sequence not in GRCh38, including all centromeres and the short arms of acrocentric chromosomes 13, 14, 15, 21, 22 · Nurk et al. 2022, Science.

Final accuracy

1 / 10,000,000
error rate · one mistake per ten million bases
  • Better than the original HGP standard
  • Verified across many independent methods
  • The most accurate human genome ever assembled
§ 3

What is so special
about CHM13?

A complete hydatidiform mole

  • Origin: an egg with no genetic material
  • Fertilized by sperm that duplicates its own genome
  • Result: two identical copies of every chromosome
  • All from the father · none from the mother

Why this simplifies assembly

CHM13 has two identical copies of each chromosome, simplifying assembly
Figure 3. Typical diploid: maternal + paternal copies differ slightly, must be distinguished. CHM13: two identical paternal copies — effectively haploid for assembly.

The catch

46,XX
CHM13 karyotype
  • Two X chromosomes · no Y
  • So T2T-CHM13 has no Y
  • Y was finished separately in 2023
  • Using a different cell line · HG002
§ 4

2020 · the first
complete chromosome

Why start with X?

  • Initial assembly was broken in only three places
  • X centromere (DXZ1) was well-studied
  • X is medically important · many disease loci
  • CHM13 has two X copies · no Y interference

Three places where X was still broken

Initial CHM13 X chromosome assembly showing three break locations
Figure 3. Initial CHM13 X assembly. Breaks at the centromere (artificially collapsed), a 120-kb segmental duplication (DMRTC1B), and a 134-kb segmental duplication with a paralogue on chromosome 2. Black bars = GRCh38 gaps; red bars = known segmental duplications · Miga et al. 2020, Nature. CC-BY 4.0.

The centromere · 3.1 megabases

  • 3.1 Mb of alpha satellite DNA
  • ~1,408 copies of a 2,057-bp repeat unit
  • Standard polishing methods made it worse
  • Reads kept landing in the wrong place

Marker-assisted polishing

  • Find 21-bp sequences that appear only once in the genome
  • Even inside DXZ1: a marker every 2.3 kb on average
  • Anchor reads using these unique markers
  • Iterate: Nanopore → PacBio → Illumina
Find what is unique even inside repeats.

Validating the centromere structure

Validation of the 3.1 Mb DXZ1 centromeric array structure
Figure 4. Multiple independent methods confirm the DXZ1 array structure: PFGE Southern blots, ddPCR copy counting, optical mapping, and 33 catalogued structural variants · Miga et al. 2020, Nature.

What the complete X delivered

  • Closed 29 gaps · 1.15 million bases of new sequence
  • Met the original Bermuda standard for finished genomes
  • Complete GAGE, CT45, CT47 gene families
  • First complete pseudoautosomal regions (PAR1 + PAR2)

Bonus · methylation across the centromere

  • Nanopore reads also detect DNA methylation
  • Found a 93 kb hypomethylated dip inside DXZ1
  • Probably where kinetochore proteins bind
  • Same dip seen later on chromosome 8

Methylation across the complete X

Methylation patterns across the complete X chromosome
Figure 5. Methylation patterns from Nanopore reads. (a) Hypomethylated PAR1; blue = unmethylated, red = methylated. (b) A 93-kb hypomethylated region inside the DXZ1 centromere — the likely kinetochore-binding core. (c) The DXZ4 macrosatellite array showing a sharp methylated→unmethylated transition · Miga et al. 2020, Nature. CC-BY 4.0.
§ 5

What's new
in T2T-CHM13?

Region 1 · centromeric satellite arrays

  • All 22 autosomes + X centromeres complete
  • Sizes: 366 kb (Y) to several megabases
  • Different chromosomes use different alpha satellite variants
  • Some centromeres (chr 1, 5, 19) share sequence

Why centromeres matter

Centromere errors → chromosome missegregation
aneuploidy → cancer, Down syndrome.
  • Kinetochores attach here during cell division
  • Most aneuploid embryos are non-viable
  • Aneuploidy in adult cells associated with cancer

Region 2 · segmental duplications

GRCh38

5.00%

T2T-CHM13

6.61%

201.93 Mb of segmental duplications · driver of structural variation.

Segmental duplications · trait + disease

Adaptive

  • AMY1 · salivary amylase
  • Copy number ↔ dietary starch

Disease

  • Many genomic disorders
  • FSHD · muscular dystrophy
  • Charcot-Marie-Tooth, etc.

Region 3 · acrocentric short arms

  • Chromosomes 13, 14, 15, 21, 22
  • Short arm carries ribosomal DNA arrays
  • Each rDNA unit = 45 kb, repeated 100s of times
  • CHM13 has ~400 rDNA copies total

The nucleolus connection

  • The five acrocentric short arms cluster together in the cell
  • That cluster forms the nucleolus
  • Where ribosome biogenesis happens
  • Short arms share ~98.7% sequence identity

Quick reference · the four hard regions

RegionSize in T2TWhy it matters
Centromeric satellites366 kb – several MbCell division
Segmental duplications201.93 Mb (6.61%)Structural variation
Acrocentric short arms66.1 Mb totalRibosome biogenesis
Yq12 heterochromatin>30 MbUnknown · evolving fast
§ 6

2023 · finishing
the Y

Why the Y was so hard

  • Long palindromes · sequences that read both directions
  • Many tandem repeats · larger than other chromosomes
  • Massive heterochromatic block on Yq12
  • More than half of Y was missing in GRCh38

T2T-Y · 62.4 megabases · zero gaps

Complete structure of the human Y chromosome from T2T-Y
Figure 5. T2T-Y assembly: 62.46 Mb, no gaps · ampliconic gene clusters, palindromes, centromere, and the previously hidden Yq12 satellite blocks revealed for the first time · Rhie et al. 2023, Nature.

What the Y carries

  • SRY · the master male-sex-determining gene
  • TSPY · 45 protein-coding copies (vs 7 in GRCh38)
  • DAZ, RBMY · spermatogenesis
  • AZF regions · deletions cause male infertility

The mysterious Yq12

  • Was a single 30+ Mb gap in GRCh38
  • Now: alternating blocks of DYZ1 + DYZ2
  • Some blocks recently duplicated up to 5 Mb
  • HSat1B (DYZ2) is almost unique to Y + acrocentrics
§ 7

Summary

What to take away

  • HGP left ~8% unsequenced · centromeres, telomeres, rDNA, segdups
  • Long reads (PacBio HiFi + Oxford Nanopore) finally span repeats
  • CHM13 · two identical paternal copies · simplifies assembly
  • 2020 X chromosome · 2022 T2T-CHM13 · 2023 T2T-Y
  • Combined: T2T-CHM13v2.0 · the first truly complete human genome
Next lecture

Where did
CHM13 come from?

Chapter 3 · The CHM13 Cell Line