BSMS205 · Genetics

The CHM13
Cell Line

Chapter 3 · Part I · The Human Genome

A puzzle to start with

Why couldn't they just
sequence a normal person?

The two-puzzle problem

Normal diploid

Two different chromosome copies
~4–5 million heterozygous variants
Repeats: which copy did the read come from?

What an assembler wants

Both copies identical
No ambiguity in repeats
One puzzle, not two interleaved ones

Where the gaps lived

151,000,000

base pairs of unknown sequence in GRCh38

Mostly in repetitive regions
Centromeres · rDNA arrays · segmental duplications
Heterozygosity made them impossible to assemble

Roadmap for today

Why heterozygosity breaks assembly
What is a complete hydatidiform mole
"Functionally haploid" · the magic property
The hTERT trick · making cells immortal
Quality control · ancestry · ethics
Why CHM13 specifically · and its limits
Summary & what comes next

§ 1

Why Heterozygosity
Breaks Assembly

One difference per thousand bases

1 / 1,000

bases differ between maternal and paternal copies

= about 0.1% of the genome
= ~4–5 million heterozygous variants
= the normal, healthy human state

Where the read came from · matters

Sequencing breaks DNA into millions of fragments
Computer reassembles by finding overlaps
In repeats: many fragments look nearly identical
Maternal? Paternal? Cannot tell → gaps or errors

The puzzle problem · in one figure

Heterozygosity makes assembly of repetitive regions ambiguous — **Figure 1.** Why heterozygosity breaks genome assembly. Reads from two similar-but-not-identical chromosome copies cannot be confidently assigned in repeats — leaving gaps or errors.

The GRCh38 compromise

Built from BAC clones · 100–200 kb pieces in bacteria
Pieces came from multiple individuals
Result: a mosaic of haplotypes
~151 Mb of N's in repetitive regions

Mosaic + heterozygosity + short reads = unfinishable.

How did the original Human Genome Project deal with this? Imperfectly. They built GRCh thirty-eight from bacterial artificial chromosomes — large pieces of DNA, about one hundred to two hundred kilobases each, cloned into bacteria and then sequenced. Those B A C s came opportunistically from multiple donors, so the resulting reference is actually a mosaic of haplotypes from different people stitched together. That helped reduce some heterozygosity locally, but it introduced its own structural inconsistencies. And in the worst regions — centromeres, ribosomal arrays, segmental duplications — the technology of the day simply gave up and recorded N's. The T2T team realized: to finish those regions, you cannot patch the strategy. You need a fundamentally different starting material.

§ 2

What Is a
Hydatidiform Mole?

Normal fertilization · 23 + 23

Egg contributes 23 chromosomes from mom
Sperm contributes 23 chromosomes from dad
Zygote: diploid, 46 chromosomes
Two different versions of every chromosome

What goes wrong in a CHM

Egg loses its nucleus — no maternal DNA
Sperm fertilizes the empty egg
Sperm DNA duplicates itself (endoreduplication)
Result: 46 chromosomes — but all from dad

Normal vs CHM · in one figure

Normal fertilization vs complete hydatidiform mole formation — **Figure 2.** Normal fertilization (top) creates a diploid heterozygous genome. A complete hydatidiform mole (bottom) is diploid but homozygous — two identical paternal chromosome sets.

Why 46,XX · not 46,YY

Most CHMs have 46,XX karyotype
Sperm carrying X duplicates → XX
Or two X-bearing sperm fertilize one empty egg
YY is not viable — humans need at least one X
CHM13 is 46,XX · all from one father

A serious moment

The biological source matters

A CHM is a failed pregnancy
Tissue donated from a real patient in 1993
Used with consent and ethical review
The data set has a human story behind it

§ 3

"Functionally
Haploid"

Three states · clearly distinguished

Term	Chromosomes	Het variants	Example
True haploid	23	0 (no pair)	Sperm, Egg
Diploid (normal)	46	~4–5 million	You, me
Functionally haploid	46	~few thousand	CHM13

Diploid in number. Haploid in information.

The "functionally haploid" picture

Functionally haploid explained: diploid in number, homozygous in sequence — **Figure 3.** CHM13 has 46 chromosomes (diploid number) but both copies are nearly identical (homozygous) — behaving like a haploid genome for assembly purposes.

Reduction in heterozygosity

Normal diploid

~4,500,000

het variants

CHM13

~few thousand

het variants

Reduction: > 99.99%. Below 0.01% of the genome.

Not perfect · but close enough

A few thousand residual heterozygous variants
One megabase-scale deletion in chr15 rDNA array
From rare endoreduplication errors + culture mutations
< 0.01% of the genome — assembly stays simple

Two puzzles vs one

Normal diploid: two similar puzzles
with pieces mixed in one box.
CHM13: one puzzle, twice over.

Centromere: 2 kb repeat × 1,400 copies
Diploid: mom 1,387 vs dad 1,421, all slightly different
CHM13: same count, same sequence — solvable

Hold the puzzle analogy in your head. In a normal diploid, you have two similar but distinct jigsaw puzzles whose pieces have been mixed into one box. Some pieces fit only puzzle A, some fit only puzzle B, and in the repetitive regions — the solid blue sky — many pieces seem to fit either. The assembler has an impossible task. In CHM thirteen, both puzzles are essentially identical photocopies, so it does not matter which piece goes where. Concretely: if a centromere has a two kilobase repeat copied fourteen hundred times, in a normal diploid mom might have thirteen hundred eighty-seven copies, dad might have fourteen hundred twenty-one, and they differ slightly in sequence. The assembler cannot reconstruct that. In CHM thirteen, both copies have the same number, the same sequence, and the problem becomes solvable.

§ 4

The hTERT Trick
Making Cells Immortal

The Hayflick limit

Normal somatic cells divide 40–60 times
Then they stop · senescence
Limit set by telomeres · TTAGGG caps
Each division: telomere shortens by 50–200 bp

Why a limit even exists

Senescence is a tumor-suppressor:
damaged cells cannot divide forever.

Healthy in the body · protects against cancer
Bad for research · cells run out mid-project
T2T needed billions of identical cells over years

Telomerase to the rescue

Telomerase = enzyme that adds TTAGGG back
Two parts:
TERT (catalytic protein)
TERC (RNA template)
Naturally on in germ cells, stem cells
Off in most adult tissues — that's why we age

Cancer's shortcut

85–95%

of cancers reactivate telomerase

The other 5–15% use ALT
Both bypass the Hayflick limit
Cancer = immortalized + dysregulated

Adding hTERT · the engineered fix

Introduce human TERT gene via viral vector
Cells make telomerase continuously
Telomeres stay long and stable
Cells can divide indefinitely
Crucially: chromosomes don't change sequence

What hTERT bought T2T

Unlimited DNA across the whole project
Genetic stability through many passages
DNA from year 1 = DNA from year 5
Multiple sequencing platforms · same source material

Same cells. Same DNA. For years.

§ 5

Quality Control
Ancestry · Ethics

Karyotyping · two methods

G-banding

Stain → light/dark band patterns
Reveals chromosome structure
Detects translocations, deletions

Spectral karyotyping (SKY)

Each chromosome → unique color
Detects chromosome swaps at a glance
Confirms 46,XX with no abnormalities

The actual karyotype

CHM13 karyotype confirmed by SKY and G-banding — **Figure 4.** CHM13 karyotyping. (a) Spectral karyotyping (SKY) — each chromosome a different color. (b) G-banding — staining patterns confirm normal 46,XX. · Miga et al., *Nature* 2020, Ext. Data Fig. 1 (CC-BY 4.0).

Sequence-level QC

Uniform read coverage · no big deletions or duplications
Low heterozygosity confirmed everywhere
Same DNA sequence across multiple years
Stable across many passages

Ancestry · what's in the genome

Analyzed via maximum likelihood admixture
~70–80% European ancestry
Small admixture: South Asian, East Asian, Native American
~1–2% Neanderthal DNA · like most non-Africans

The admixture plot

Maximum likelihood admixture analysis of CHM13 ancestry — **Figure 5.** Maximum-likelihood admixture analysis. CHM13 (highlighted bar) is predominantly European, with smaller contributions from other reference populations. Reading the figure: each vertical bar = one individual; each color = one ancestral population component · Miga et al. 2020, *Nature*, Ext. Data Fig. 2 (CC-BY 4.0).

Here is the admixture analysis figure from Miga twenty-twenty itself. Every vertical bar in the plot is one individual. Every color in the bar is the estimated fraction of that person's genome from a different ancestral population — for example, European, African, East Asian, South Asian. CHM thirteen sits among the predominantly European bars but with visible streaks of other colors, consistent with a small admixed component. The point of showing this is transparency. When you use a single genome as a reference, you should know whose genome it is and where in the global ancestry landscape it sits. CHM thirteen is one specific point on this plot — and the rest of human diversity lives in the bars around it. That is why we need pangenomes, which is the topic of chapter four.

Does ancestry matter for T2T?

What CHM13 gives

Complete structural template
No heterozygosity confusion
Gap-free assembly

What it doesn't give

Population-specific variants
Diversity across humans
Heterozygous structure

A point worth making explicit

No single genome is "humanity"

The limit isn't CHM13's European ancestry per se
It's that any single genome is partial
An African or East Asian CHM would have the same issue
Solution: pangenome · many genomes together

§ 6

Why CHM13?
And Its Limits

Six reasons CHM13 won

Already well-characterized since the 1990s
Stable 46,XX karyotype across passages
X-bearing → could finish X chromosome first
hTERT immortalization worked cleanly
High-molecular-weight DNA → ultra-long reads
Community-supported: Genome in a Bottle, shared protocols

Six reasons CHM thirteen specifically. One — it had been used in genomics studies since the nineteen nineties, so it was already well-characterized. Two — it maintained a stable forty-six X X karyotype through many passages, which is unusual for long-cultured lines. Three — having two X chromosomes meant the project could attack the X chromosome first, which had been extensively studied and was finished in twenty twenty as a proof of concept. Four — the h T E R T immortalization worked without introducing chromosomal abnormalities. Five — CHM thirteen cells produce high-molecular-weight DNA, which is required for ultra-long-read Nanopore sequencing where some reads are over a megabase. And six — by twenty seventeen the line was already part of the Genome in a Bottle reference materials, so multiple labs had shared protocols and validation data. The combination is what mattered.

What T2T actually used CHM13 for

30× PacBio HiFi · ~20 kb high-accuracy reads
50× Oxford Nanopore ultra-long · > 100 kb
100× Illumina short reads · for polishing
Plus Hi-C, Bionano, Strand-seq

CHM13 vs GRCh38 · the contrast

Feature	CHM13	GRCh38
Source	One CHM, duplicated paternal	Mosaic, multiple donors
Het variants	~few thousand	~4–5 M per individual
Gaps	Zero (3.055 Bbp)	~151 Mb
Centromeres	All 24 complete	Mostly absent
Acrocentric arms	66.1 Mb resolved	Almost entirely missing
Y chromosome	From HG002 in v2.0	> 50% missing

Side by side. CHM thirteen comes from one complete hydatidiform mole with duplicated paternal DNA. GRCh thirty-eight is a mosaic of multiple donors stitched together via B A C clones. CHM thirteen has only a few thousand heterozygous variants. GRCh thirty-eight effectively carries millions, plus structural inconsistencies between donor segments. CHM thirteen has zero gaps across three billion fifty-five million base pairs. GRCh thirty-eight has one hundred fifty-one million base pairs of N's. CHM thirteen resolves all twenty-four centromeres completely. GRCh thirty-eight had them as placeholders. CHM thirteen resolves the acrocentric short arms — sixty-six point one megabases — that GRCh thirty-eight had largely as gaps. And the Y chromosome, which CHM thirteen lacks because it is X X, is added in version two from a different male genome called H G zero zero two. This is the chimeric reference, and it is the cost of using a forty-six X X line.

What CHM13 cannot tell you

Population diversity · only one source genome
Phasing · which variants travel together on a chromosome
Compound heterozygotes · two different alleles per gene
Allele-specific expression · maternal vs paternal output
Heterozygous structural variants · common in real people

The chimeric reference problem

CHM13 is 46,XX · no Y
T2T-CHM13 v2.0 stitches in HG002 Y
The reference is now chimeric
Two different genetic backgrounds in one file

§ 7

Summary

What to take away

Heterozygosity in repeats prevented complete assembly
Complete hydatidiform mole = empty egg + duplicated sperm DNA
CHM13 is functionally haploid: 46 chromosomes, < 0.01% het
hTERT = unlimited, stable cells across years and platforms
One source genome → structural template, not full diversity

Five things to take away. One — heterozygosity in repetitive regions is what prevented complete assembly of the human genome for two decades. Two — complete hydatidiform moles arise when an empty egg is fertilized by a sperm whose DNA then duplicates, producing forty-six chromosomes all from one father. Three — CHM thirteen is functionally haploid: forty-six chromosomes, but with less than zero point zero one percent heterozygosity. Four — the h T E R T modification gave the T2T project unlimited, stable cells across years and across multiple sequencing platforms. Five — CHM thirteen provides a complete structural template of a human genome, but represents only one source, so we still need pangenomes to capture diversity. Hold these five points and you have the chapter.

Next lecture

One genome is the map.
How do we capture everyone?

Chapter 4 · The Human Pangenome

The CHM13Cell Line

Why couldn't they justsequence a normal person?

The two-puzzle problem

Normal diploid

What an assembler wants

Roadmap for today

Why HeterozygosityBreaks Assembly

One difference per thousand bases

Where the read came from · matters

The puzzle problem · in one figure

The GRCh38 compromise

What Is aHydatidiform Mole?

Normal fertilization · 23 + 23

What goes wrong in a CHM

Normal vs CHM · in one figure

Why 46,XX · not 46,YY

The biological source matters

"FunctionallyHaploid"

Three states · clearly distinguished

The "functionally haploid" picture

Reduction in heterozygosity

Normal diploid

CHM13

Not perfect · but close enough

Two puzzles vs one

The hTERT TrickMaking Cells Immortal

The Hayflick limit

Why a limit even exists

Telomerase to the rescue

Cancer's shortcut

Adding hTERT · the engineered fix

What hTERT bought T2T

Quality ControlAncestry · Ethics

Karyotyping · two methods

G-banding

Spectral karyotyping (SKY)

The actual karyotype

Sequence-level QC

Ancestry · what's in the genome

The admixture plot

Does ancestry matter for T2T?

What CHM13 gives

What it doesn't give

No single genome is "humanity"

Why CHM13?And Its Limits

Six reasons CHM13 won

What T2T actually used CHM13 for

CHM13 vs GRCh38 · the contrast

What CHM13 cannot tell you

The chimeric reference problem

Summary

What to take away

One genome is the map. How do we capture everyone?

The CHM13
Cell Line

Why couldn't they just
sequence a normal person?

Why Heterozygosity
Breaks Assembly

What Is a
Hydatidiform Mole?

"Functionally
Haploid"

The hTERT Trick
Making Cells Immortal

Quality Control
Ancestry · Ethics

Why CHM13?
And Its Limits

One genome is the map.
How do we capture everyone?