BSMS205 · Genetics

The Telomere-to-
Telomere Project

Chapter 2 · Part I · The Human Genome

A question to start with

Was the genome
really complete?

The "complete" genome wasn't

HGP declared complete in 2003
About 8% still unsequenced
Not random gaps — the hardest regions
Centromeres · telomeres · ribosomal DNA

Functionally critical, technically impossible.

Let's be specific about what was missing. About eight percent of the genome remained unsequenced after two thousand three. That is not a rounding error — that is roughly two hundred and forty million base pairs of human DNA. And these were not random scattered gaps. They were concentrated in very specific regions. Centromeres — the central pinch of each chromosome that helps separate them during cell division. Telomeres — the protective caps at chromosome ends. Ribosomal DNA arrays — the genes that build the ribosomes that make every protein in your body. These are not optional regions. Without them, cells cannot divide, cannot make proteins, cannot function. They were missing not because they did not matter, but because they were technically impossible to sequence.

The audacious goal

3,055,000,000

base pairs · every single one · telomere to telomere

The first truly complete human genome
Released in 2022 as T2T-CHM13
Added ~200 million bp of new sequence

Roadmap for today

Why the HGP couldn't finish
The technologies that changed everything
CHM13 · the special cell line
The X chromosome · 2020 · proof of concept
What's new in T2T-CHM13 · 2022
Finishing the Y chromosome · 2023
Summary & what comes next

§ 1

Why couldn't
the HGP finish?

Short reads · 100–200 base pairs

HGP-era sequencing read tiny fragments
Worked great for unique DNA sequence
Reassembled like overlapping puzzle pieces
One method, used on the whole genome

Where short reads fail

Short-read sequencing fails in repetitive regions while long reads succeed — **Figure 1.** Short reads (100–200 bp) cannot place themselves in long repetitive regions — every repeat looks the same. Long reads (20 kb to 100+ kb) span multiple repeats in one read, anchoring the sequence uniquely.

The book analogy

Imagine a book where dozens of pages
all read: "and then they walked".

You can read each page
You cannot tell which page goes where
The story falls apart in the middle
That's the centromere problem

The four hard regions

Centromeres · alpha satellite arrays · 100s of kb to Mb
Telomeres · TTAGGG repeats at chromosome ends
Ribosomal DNA · 45 kb units repeated 100s of times
Segmental duplications · large blocks · 90–99% identical

Together: ~240 million base pairs of "dark matter".

Let's name the four kinds of regions that broke short reads. Centromeres — the alpha satellite arrays at chromosome centers, ranging from hundreds of kilobases to several megabases of repeats. Telomeres — the T-T-A-G-G-G repeat caps at chromosome ends. Ribosomal DNA arrays — the genes that build ribosomes, with each unit forty-five kilobases long, repeated dozens to hundreds of times in tandem. And segmental duplications — large blocks of DNA, ten thousand to millions of base pairs long, copied across the genome at ninety to ninety-nine percent sequence identity. Together these regions are about two hundred and forty million base pairs. People used to call this "dark matter" — visible in microscopy, invisible to sequencing. The T two T project finally turned the lights on.

§ 2

The technologies
that changed it

Two long-read platforms

PacBio HiFi

~20,000 bp per read
Very high accuracy
Resolves most repeats

Oxford Nanopore

>100,000 bp per read
"Ultra-long" reads
Spans even mega-repeats

Plus four supporting methods

Illumina short reads · for error correction
Hi-C · maps how DNA folds in 3D
Bionano optical mapping · long-range physical map
Strand-seq · which strand came from which parent

String graphs · not a line

Old assemblies = a single linear sequence
T2T uses a graph: nodes = sequences, edges = overlaps
Like a subway map · multiple possible routes
Long reads pick the correct path through repeats

The string graph · seeing the tangle

T2T-CHM13 string graph showing tangled repetitive regions — **Figure 2.** The CHM13 string graph. Each line is a sequence; intersections show overlaps. Tangled regions = highly repetitive zones (centromeric satellites, ribosomal DNA arrays). Long reads pick the correct path through each tangle · Nurk et al. 2022, *Science* (preprint *bioRxiv*). CC-BY 4.0.

Here is what a string graph actually looks like, taken straight from the Nurk twenty twenty-two paper. Every line you see represents a piece of D N A sequence, and every place where lines cross is a place where two sequences overlap. Notice the long, clean stretches — those are the easy parts of the genome where the path is unambiguous. Then look at the dense, tangled regions. Those are the hard parts. The biggest tangles correspond to centromeric satellite arrays and the ribosomal D N A repeats on the acrocentric chromosome short arms — exactly the regions the H G P could not finish. Inside one of those tangles there are dozens or hundreds of possible routes that connect the same start to the same end. Without long reads, you cannot tell which route is correct. With long reads spanning right through the tangle, you can. That is the entire trick of T two T assembly, distilled into one picture.

What it looks like

T2T-CHM13 assembly ideogram showing newly added regions in red — **Figure 2.** T2T-CHM13 assembly ideogram. Red regions = new sequence not in GRCh38, including all centromeres and the short arms of acrocentric chromosomes 13, 14, 15, 21, 22 · Nurk et al. 2022, *Science*.

Final accuracy

1 / 10,000,000

error rate · one mistake per ten million bases

Better than the original HGP standard
Verified across many independent methods
The most accurate human genome ever assembled

§ 3

What is so special
about CHM13?

A complete hydatidiform mole

Origin: an egg with no genetic material
Fertilized by sperm that duplicates its own genome
Result: two identical copies of every chromosome
All from the father · none from the mother

Why this simplifies assembly

CHM13 has two identical copies of each chromosome, simplifying assembly — **Figure 3.** Typical diploid: maternal + paternal copies differ slightly, must be distinguished. CHM13: two identical paternal copies — effectively haploid for assembly.

The catch

46,XX

CHM13 karyotype

Two X chromosomes · no Y
So T2T-CHM13 has no Y
Y was finished separately in 2023
Using a different cell line · HG002

§ 4

2020 · the first
complete chromosome

Why start with X?

Initial assembly was broken in only three places
X centromere (DXZ1) was well-studied
X is medically important · many disease loci
CHM13 has two X copies · no Y interference

Three places where X was still broken

Initial CHM13 X chromosome assembly showing three break locations — **Figure 3.** Initial CHM13 X assembly. Breaks at the **centromere** (artificially collapsed), a 120-kb segmental duplication (*DMRTC1B*), and a 134-kb segmental duplication with a paralogue on chromosome 2. Black bars = GRCh38 gaps; red bars = known segmental duplications · Miga et al. 2020, *Nature*. CC-BY 4.0.

Here is the figure from Miga twenty twenty Nature that shows exactly where the X chromosome was still broken. Three locations. The centromere — D X Z one — which had been artificially collapsed in the assembly because the methods could not handle it. A one hundred twenty kilobase segmental duplication near a gene called D M R T C one B. And a one hundred thirty-four kilobase segmental duplication that has a near-identical paralogue on chromosome two. The black bars on the figure mark gaps that were also present in the GRCh thirty-eight reference — these were inherited from the H G P era. The red bars show segmental duplications. Notice how the breaks cluster at exactly the kinds of regions we discussed earlier — repeats and segmental duplications. The figure makes the abstract idea concrete. Three specific places, each requiring its own solution. The rest of the chapter is about how they solved each one.

The centromere · 3.1 megabases

3.1 Mb of alpha satellite DNA
~1,408 copies of a 2,057-bp repeat unit
Standard polishing methods made it worse
Reads kept landing in the wrong place

Marker-assisted polishing

Find 21-bp sequences that appear only once in the genome
Even inside DXZ1: a marker every 2.3 kb on average
Anchor reads using these unique markers
Iterate: Nanopore → PacBio → Illumina

Find what is unique even inside repeats.

Their solution was a beautiful idea called marker-assisted polishing. Even inside a highly repetitive array like the X centromere, the repeats are not perfectly identical — there are tiny variations. The team scanned the genome for short twenty-one base pair sequences that appear only once in the whole genome, and they used these as unique markers. Inside the D X Z one centromeric array, they found a unique marker every two point three kilobases on average. So even though most of the array looked the same, they had little anchor points scattered through it. They then used these markers to correctly place reads, polishing the assembly iteratively — first with Nanopore reads, then PacBio, then Illumina. Each round improved accuracy. The headline insight: even inside repeats, find what is unique.

Validating the centromere structure

Validation of the 3.1 Mb DXZ1 centromeric array structure — **Figure 4.** Multiple independent methods confirm the DXZ1 array structure: PFGE Southern blots, ddPCR copy counting, optical mapping, and 33 catalogued structural variants · Miga et al. 2020, *Nature*.

The team validated the centromere assembly using several independent methods, shown in this figure from the Miga twenty twenty Nature paper. They used pulsed-field gel electrophoresis — a way of sizing very large D N A molecules — and confirmed the array was about two point eight seven megabases. They used droplet digital P C R to count repeat copies and got fourteen hundred and eight, matching the assembly. They used optical mapping to check the restriction enzyme pattern. And they catalogued thirty-three different structural variants within the array — places where the basic twelve-mer repeat unit had been altered into eleven-mers, eighteen-mers, twenty-two-mers, and so on. This level of detail had never been seen before. The X centromere went from a black box to a fully described biological object.

What the complete X delivered

Closed 29 gaps · 1.15 million bases of new sequence
Met the original Bermuda standard for finished genomes
Complete GAGE, CT45, CT47 gene families
First complete pseudoautosomal regions (PAR1 + PAR2)

Bonus · methylation across the centromere

Nanopore reads also detect DNA methylation
Found a 93 kb hypomethylated dip inside DXZ1
Probably where kinetochore proteins bind
Same dip seen later on chromosome 8

And one bonus finding from the X paper. Nanopore sequencing has the unusual property that it can detect D N A methylation directly from the same reads used for sequencing. So once the centromere was assembled, the team could also map methylation across it at single-base resolution. They found something striking — a ninety-three kilobase region inside D X Z one that is consistently unmethylated, surrounded by heavily methylated DNA. That hypomethylated region is most likely where kinetochore proteins bind during cell division — the actual functional core of the centromere. They later assembled the centromere of chromosome eight and saw the same pattern. So this hypomethylated dip may be a general signature of where centromeres do their actual job. None of this would have been visible without finishing the assembly.

Methylation across the complete X

Methylation patterns across the complete X chromosome — **Figure 5.** Methylation patterns from Nanopore reads. **(a)** Hypomethylated PAR1; blue = unmethylated, red = methylated. **(b)** A 93-kb hypomethylated region inside the DXZ1 centromere — the likely kinetochore-binding core. **(c)** The DXZ4 macrosatellite array showing a sharp methylated→unmethylated transition · Miga et al. 2020, *Nature*. CC-BY 4.0.

Here is the methylation map from the same Miga twenty twenty paper. Three panels. Panel a — the pseudoautosomal region one, P A R one, at the very tip of the short arm of X. It is consistently unmethylated, shown in blue, with methylated bases in red. Panel b is the most famous one — that ninety-three kilobase hypomethylated dip sitting inside the D X Z one centromere. The surrounding centromeric repeats are heavily methylated, but this one block is open. That is almost certainly where the kinetochore actually binds during cell division — the functional center of the centromere, identified for the first time. Panel c shows the D X Z four macrosatellite array, which marks the boundary between two large topological superdomains of the X chromosome — and you can see a sharp methylation transition right at that boundary. None of these patterns were visible before the X was finished. Methylation, kinetochore biology, and chromosome topology — all readable from the same set of long reads.

§ 5

What's new
in T2T-CHM13?

Region 1 · centromeric satellite arrays

All 22 autosomes + X centromeres complete
Sizes: 366 kb (Y) to several megabases
Different chromosomes use different alpha satellite variants
Some centromeres (chr 1, 5, 19) share sequence

Why centromeres matter

Centromere errors → chromosome missegregation
→ aneuploidy → cancer, Down syndrome.

Kinetochores attach here during cell division
Most aneuploid embryos are non-viable
Aneuploidy in adult cells associated with cancer

Region 2 · segmental duplications

GRCh38

5.00%

T2T-CHM13

6.61%

201.93 Mb of segmental duplications · driver of structural variation.

Segmental duplications · trait + disease

Adaptive

AMY1 · salivary amylase
Copy number ↔ dietary starch

Disease

Many genomic disorders
FSHD · muscular dystrophy
Charcot-Marie-Tooth, etc.

Region 3 · acrocentric short arms

Chromosomes 13, 14, 15, 21, 22
Short arm carries ribosomal DNA arrays
Each rDNA unit = 45 kb, repeated 100s of times
CHM13 has ~400 rDNA copies total

The nucleolus connection

The five acrocentric short arms cluster together in the cell
That cluster forms the nucleolus
Where ribosome biogenesis happens
Short arms share ~98.7% sequence identity

Here is something that fits beautifully into cell biology. In the nucleus of every cell, the five acrocentric short arms cluster together physically — the proteins and RNA that need to find ribosomal D N A all gather there, and that gathering point is what we call the nucleolus. The nucleolus is the factory where ribosomes are assembled. And here is the kicker — the short arms across these five different chromosomes share about ninety-eight point seven percent sequence identity. They are essentially clones of each other. Why so similar? Probably because they are physically close in the nucleolus and exchange DNA frequently. So the genome architecture and the nuclear architecture reinforce each other. Cool result, only visible because T two T finally assembled these regions.

Quick reference · the four hard regions

Region	Size in T2T	Why it matters
Centromeric satellites	366 kb – several Mb	Cell division
Segmental duplications	201.93 Mb (6.61%)	Structural variation
Acrocentric short arms	66.1 Mb total	Ribosome biogenesis
Yq12 heterochromatin	>30 Mb	Unknown · evolving fast

§ 6

2023 · finishing
the Y

Why the Y was so hard

Long palindromes · sequences that read both directions
Many tandem repeats · larger than other chromosomes
Massive heterochromatic block on Yq12
More than half of Y was missing in GRCh38

T2T-Y · 62.4 megabases · zero gaps

Complete structure of the human Y chromosome from T2T-Y — **Figure 5.** T2T-Y assembly: 62.46 Mb, no gaps · ampliconic gene clusters, palindromes, centromere, and the previously hidden Yq12 satellite blocks revealed for the first time · Rhie et al. 2023, *Nature*.

What the Y carries

SRY · the master male-sex-determining gene
TSPY · 45 protein-coding copies (vs 7 in GRCh38)
DAZ, RBMY · spermatogenesis
AZF regions · deletions cause male infertility

The mysterious Yq12

Was a single 30+ Mb gap in GRCh38
Now: alternating blocks of DYZ1 + DYZ2
Some blocks recently duplicated up to 5 Mb
HSat1B (DYZ2) is almost unique to Y + acrocentrics

§ 7

Summary

What to take away

HGP left ~8% unsequenced · centromeres, telomeres, rDNA, segdups
Long reads (PacBio HiFi + Oxford Nanopore) finally span repeats
CHM13 · two identical paternal copies · simplifies assembly
2020 X chromosome · 2022 T2T-CHM13 · 2023 T2T-Y
Combined: T2T-CHM13v2.0 · the first truly complete human genome

Five things to take away. One — the HGP left about eight percent of the genome unsequenced, concentrated in centromeres, telomeres, ribosomal DNA arrays, and segmental duplications. Two — long-read sequencing, specifically PacBio HiFi for accuracy and Oxford Nanopore for ultra-long reads, finally allowed the assembly to span those repetitive regions. Three — the cell line C H M thirteen, with its two identical paternal chromosome copies, simplified the assembly problem from diploid to effectively haploid. Four — the project ran in three milestones: the X chromosome in twenty twenty as proof of concept, the full forty-six X X assembly as T two T dash C H M thirteen in twenty twenty-two, and the Y from a different cell line in twenty twenty-three. Five — combined, these become T two T dash C H M thirteen version two point zero, the first truly complete human genome — every base, every chromosome, telomere to telomere.

Next lecture

Where did
CHM13 come from?

Chapter 3 · The CHM13 Cell Line