BSMS205 · Genetics

The Human
Genome Project

Chapter 1 · Part I · The Human Genome

A question to start with

Imagine a car
with no manual.

What we knew · what we didn't

Knew

Some disease genes (CF, sickle cell)
DNA carries information
Mendelian inheritance

Didn't know

How many genes humans have
Where most of them sit
What most DNA is doing

Each gene was a multi-year quest

Start almost from scratch for each new gene
Indirect methods to localize roughly
Narrow down bit by bit over years

Like searching for a house in a city
with no map and no address.

The audacious goal

3,000,000,000

DNA base pairs · the entire human genome

One complete reference for all biology
The periodic table equivalent for genetics
Launched 1990 · originally a 15-year plan

Roadmap for today

Why we needed a genome project
The four goals of the HGP
The journey · 1990 to 2003
The big surprise · only 20,000 genes
How it transformed genetics
What it left unfinished — the 8%
Summary & what comes next

§ 1

Before
the Map

One gene at a time

Each gene = a multi-year project
No standard coordinate system
No way to compare findings across labs
Most of the genome was literally unknown

The transformation in one picture

Pre-HGP gene-by-gene research vs post-HGP genome-wide research — **Figure 1.** Pre-HGP (left): gene-by-gene research with no shared map. Post-HGP (right): all genes accessible on a shared coordinate system. Like switching from local landmarks to GPS.

§ 2

Four
Interconnected Goals

Goal 1 · Sequence the entire genome

Determine the order of all ~3 billion nucleotides
Output: a single reference sequence
A standard "text" for locating genes & comparing genomes

The first complete map
of an unexplored continent.

Goal 2 · Identify all human genes

The expectation

~100,000

genes (some predicted)

The reality (spoiler)

~20,000

genes

A surprise that changed our view of biology.

Goal 3 · Drive new technologies

1990: sequencing one gene took months
3 billion bases at that pace → centuries
HGP forced advances in sequencing, robotics, computing
The technology push was as important as the science

Goal 4 · Map genetic variation

The reference is one sequence — but humans differ
Catalog landmarks: SNPs · single-letter differences
Use them as mile markers on each chromosome
Link variants → diseases, traits, ancestry

Four goals · one infrastructure

Four interconnected goals of the HGP — **Figure 2.** The HGP was not just sequencing — it was infrastructure: reference sequence, gene catalog, technology platforms, variation map. Each goal made the others more useful.

§ 3

The Journey
1990 → 2003

1990 – 1998 · learning by doing

Did not jump straight into human DNA
Practiced on E. coli, yeast, C. elegans
Worm genome completed 1998
Built the methods and the muscle

1999 · first complete human chromosome

chromosome 22 · the first to finish

One of the smallest autosomes
Proved the methods worked at human scale
A test case for the rest of the genome

2000 · the race

Public HGP

International consortium
Hierarchical, careful
Open data within 24 h

Celera Genomics

Private · Craig Venter
Whole-genome shotgun
Faster · proprietary

June 2000: both announce ~90% draft. Joint Clinton–Blair press conference.

2001 · two papers, two journals

Nature

The public consortium
Open data, open method

Science

Celera
Whole-genome shotgun

Drafts — not finished. Gaps and errors remained, especially in repeats.

2003 · the "finished" genome

Announced April 2003 · 50 years after Watson & Crick
99% of euchromatic regions sequenced
Error rate < 1 / 100,000 bases
Released as GRCh37, later GRCh38 (hg38)

Timeline at a glance

Year	Milestone
1990	HGP launches · 15-year plan
1995–98	Practice on E. coli, yeast, C. elegans
1999	Chromosome 22 finished
2000	Draft (~90%) · public + Celera
2001	Drafts published · Nature & Science
2003	"Finished" · 99% euchromatin · GRCh37
2013	GRCh38 released
2022	T2T-CHM13 closes the remaining 8%

§ 4

Only
20,000 Genes?

The final count

~20,000

protein-coding genes in the human genome

Predicted: 50,000 – 100,000
Reality: ~20,000 – 25,000
The expectation had to be thrown out

Gene count vs complexity

Organism	Genes	Comment
Human	~20,000	You are reading this
Roundworm (C. elegans)	~20,000	959 cells, no brain
Fruit fly (D. melanogaster)	~14,000	Compound eye, simple nervous system
Rice (O. sativa)	~40,000	Twice as many as us

Rice has twice as many protein-coding genes as you do.

So what makes humans complex?

Alternative splicing — one gene → many proteins
Gene regulation — when, where, how much
Regulatory networks — combinations of inputs
Non-coding sequence — controls everything above

Complexity comes from orchestration, not inventory.

Only 1.5% codes for protein

1.5%

of the genome encodes proteins

The other 98.5%: regulatory, structural, repeats
Once dismissed as "junk DNA"
Now: hotspot of regulatory function

The picture, in one figure

Gene number vs biological complexity across species — **Figure 3.** Humans (~20k) ≈ worms (~20k), < rice (~40k). Complexity comes from regulation and splicing, not gene count.

§ 5

How It Changed
Genetics

From gene-by-gene to genome-wide

Pre-HGP

One gene at a time
Years per gene
Indirect localization

Post-HGP

All 20,000 genes at once
Compare cancer vs normal: every gene
RNA-seq, GWAS, exome studies

A common reference · GPS for the genome

Every variant has an address: chr7:117,559,593
Any researcher can look it up, replicate, extend
Made GWAS possible — thousands compared on the same axis
Disease gene discovery: years → months

The cost crash

Year	Cost per genome	Note
1990	~$3 billion	Estimated · pre-HGP technology
2003	~$300 million	HGP completion
2008	~$1 million	NGS arrives
Today	< $1,000	Routine in clinics

Cost dropped faster than Moore's Law.

The Bermuda Principles · open data

All sequence data released publicly
within 24 hours of generation.

Agreed at a 1996 meeting in Bermuda
No patents · no paywalls · no waiting for publication
The genome belongs to everyone
Set the cultural standard for genomics ever since

§ 6

The 8%
That Remained

Where the gaps were

Centromeres · highly repetitive chromosome centers
Telomeres · repetitive caps at chromosome ends
Ribosomal DNA arrays · hundreds of near-identical copies
Segmental duplications · large, near-identical blocks

~240 million base pairs · not "junk" — essential.

Why the gaps existed

Short reads cannot span
long, near-identical repeats.

HGP technology read DNA in short fragments
Identical repeats look the same → can't be placed
Like a book with many pages all reading: "and then they walked"

Short reads vs long reads

Short-read vs long-read sequencing eras — **Figure 4.** Left: HGP-era short reads leave gaps in repetitive regions. Right: long-read sequencing (T2T, 2022) closes those gaps and produces the first truly gapless human genome.

§ 7

Summary

What to take away

HGP 1990 – 2003 · 3 billion bp · public + Celera
Four goals: sequence · catalog · technology · variation
Surprise: only ~20,000 genes; complexity = regulation
Cost: $3 B → < $1,000 · Bermuda Principles
~8% gaps remained — closed by T2T in 2022

Five things to take away. One — the HGP ran from nineteen ninety to two thousand three, sequenced about three billion base pairs, and was a parallel public-and-private effort with the public consortium and Celera. Two — its goals were four interconnected pieces: a reference sequence, a gene catalog, a technology revolution, and a variation map. Three — the most surprising single finding was that humans have only about twenty thousand genes, not the fifty to one hundred thousand expected, and that biological complexity comes from regulation and splicing, not gene count. Four — the cost of sequencing crashed from billions to under a thousand, and the Bermuda Principles established open data as a cultural norm in genomics. Five — about eight percent of the genome, mostly repetitive regions, remained as gaps until the Telomere-to-Telomere project finished them in twenty twenty-two. Hold those five points; everything else you will read about the HGP fits inside them.

Next lecture

If 99% was already done,
why finish it?

Chapter 2 · The Telomere-to-Telomere Project