BSMS205 · Genetics

The Human
Genome Project

Chapter 1 · Part I · The Human Genome
A question to start with

Imagine a car
with no manual.

What we knew · what we didn't

Knew

  • Some disease genes (CF, sickle cell)
  • DNA carries information
  • Mendelian inheritance

Didn't know

  • How many genes humans have
  • Where most of them sit
  • What most DNA is doing

Each gene was a multi-year quest

  • Start almost from scratch for each new gene
  • Indirect methods to localize roughly
  • Narrow down bit by bit over years
Like searching for a house in a city
with no map and no address.
The audacious goal
3,000,000,000
DNA base pairs · the entire human genome
  • One complete reference for all biology
  • The periodic table equivalent for genetics
  • Launched 1990 · originally a 15-year plan

Roadmap for today

  1. Why we needed a genome project
  2. The four goals of the HGP
  3. The journey · 1990 to 2003
  4. The big surprise · only 20,000 genes
  5. How it transformed genetics
  6. What it left unfinished — the 8%
  7. Summary & what comes next
§ 1

Before
the Map

One gene at a time

  • Each gene = a multi-year project
  • No standard coordinate system
  • No way to compare findings across labs
  • Most of the genome was literally unknown

The transformation in one picture

Pre-HGP gene-by-gene research vs post-HGP genome-wide research
Figure 1. Pre-HGP (left): gene-by-gene research with no shared map. Post-HGP (right): all genes accessible on a shared coordinate system. Like switching from local landmarks to GPS.
§ 2

Four
Interconnected Goals

Goal 1 · Sequence the entire genome

  • Determine the order of all ~3 billion nucleotides
  • Output: a single reference sequence
  • A standard "text" for locating genes & comparing genomes
The first complete map
of an unexplored continent.

Goal 2 · Identify all human genes

The expectation

~100,000
genes (some predicted)

The reality (spoiler)

~20,000
genes

A surprise that changed our view of biology.

Goal 3 · Drive new technologies

  • 1990: sequencing one gene took months
  • 3 billion bases at that pace → centuries
  • HGP forced advances in sequencing, robotics, computing
  • The technology push was as important as the science

Goal 4 · Map genetic variation

  • The reference is one sequence — but humans differ
  • Catalog landmarks: SNPs · single-letter differences
  • Use them as mile markers on each chromosome
  • Link variants → diseases, traits, ancestry

Four goals · one infrastructure

Four interconnected goals of the HGP
Figure 2. The HGP was not just sequencing — it was infrastructure: reference sequence, gene catalog, technology platforms, variation map. Each goal made the others more useful.
§ 3

The Journey
1990 → 2003

1990 – 1998 · learning by doing

  • Did not jump straight into human DNA
  • Practiced on E. coli, yeast, C. elegans
  • Worm genome completed 1998
  • Built the methods and the muscle

1999 · first complete human chromosome

22
chromosome 22 · the first to finish
  • One of the smallest autosomes
  • Proved the methods worked at human scale
  • A test case for the rest of the genome

2000 · the race

Public HGP

  • International consortium
  • Hierarchical, careful
  • Open data within 24 h

Celera Genomics

  • Private · Craig Venter
  • Whole-genome shotgun
  • Faster · proprietary

June 2000: both announce ~90% draft. Joint Clinton–Blair press conference.

2001 · two papers, two journals

Nature

  • The public consortium
  • Open data, open method

Science

  • Celera
  • Whole-genome shotgun
Drafts — not finished. Gaps and errors remained, especially in repeats.

2003 · the "finished" genome

  • Announced April 2003 · 50 years after Watson & Crick
  • 99% of euchromatic regions sequenced
  • Error rate < 1 / 100,000 bases
  • Released as GRCh37, later GRCh38 (hg38)

Timeline at a glance

YearMilestone
1990HGP launches · 15-year plan
1995–98Practice on E. coli, yeast, C. elegans
1999Chromosome 22 finished
2000Draft (~90%) · public + Celera
2001Drafts published · Nature & Science
2003"Finished" · 99% euchromatin · GRCh37
2013GRCh38 released
2022T2T-CHM13 closes the remaining 8%
§ 4

Only
20,000 Genes?

The final count
~20,000
protein-coding genes in the human genome
  • Predicted: 50,000 – 100,000
  • Reality: ~20,000 – 25,000
  • The expectation had to be thrown out

Gene count vs complexity

OrganismGenesComment
Human~20,000You are reading this
Roundworm (C. elegans)~20,000959 cells, no brain
Fruit fly (D. melanogaster)~14,000Compound eye, simple nervous system
Rice (O. sativa)~40,000Twice as many as us

Rice has twice as many protein-coding genes as you do.

So what makes humans complex?

  • Alternative splicing — one gene → many proteins
  • Gene regulation — when, where, how much
  • Regulatory networks — combinations of inputs
  • Non-coding sequence — controls everything above
Complexity comes from orchestration, not inventory.

Only 1.5% codes for protein

1.5%
of the genome encodes proteins
  • The other 98.5%: regulatory, structural, repeats
  • Once dismissed as "junk DNA"
  • Now: hotspot of regulatory function

The picture, in one figure

Gene number vs biological complexity across species
Figure 3. Humans (~20k) ≈ worms (~20k), < rice (~40k). Complexity comes from regulation and splicing, not gene count.
§ 5

How It Changed
Genetics

From gene-by-gene to genome-wide

Pre-HGP

  • One gene at a time
  • Years per gene
  • Indirect localization

Post-HGP

  • All 20,000 genes at once
  • Compare cancer vs normal: every gene
  • RNA-seq, GWAS, exome studies

A common reference · GPS for the genome

  • Every variant has an address: chr7:117,559,593
  • Any researcher can look it up, replicate, extend
  • Made GWAS possible — thousands compared on the same axis
  • Disease gene discovery: years → months

The cost crash

YearCost per genomeNote
1990~$3 billionEstimated · pre-HGP technology
2003~$300 millionHGP completion
2008~$1 millionNGS arrives
Today< $1,000Routine in clinics

Cost dropped faster than Moore's Law.

The Bermuda Principles · open data

All sequence data released publicly
within 24 hours of generation.
  • Agreed at a 1996 meeting in Bermuda
  • No patents · no paywalls · no waiting for publication
  • The genome belongs to everyone
  • Set the cultural standard for genomics ever since
§ 6

The 8%
That Remained

Where the gaps were

  • Centromeres · highly repetitive chromosome centers
  • Telomeres · repetitive caps at chromosome ends
  • Ribosomal DNA arrays · hundreds of near-identical copies
  • Segmental duplications · large, near-identical blocks
~240 million base pairs · not "junk" — essential.

Why the gaps existed

Short reads cannot span
long, near-identical repeats.
  • HGP technology read DNA in short fragments
  • Identical repeats look the same → can't be placed
  • Like a book with many pages all reading: "and then they walked"

Short reads vs long reads

Short-read vs long-read sequencing eras
Figure 4. Left: HGP-era short reads leave gaps in repetitive regions. Right: long-read sequencing (T2T, 2022) closes those gaps and produces the first truly gapless human genome.
§ 7

Summary

What to take away

  • HGP 1990 – 2003 · 3 billion bp · public + Celera
  • Four goals: sequence · catalog · technology · variation
  • Surprise: only ~20,000 genes; complexity = regulation
  • Cost: $3 B → < $1,000 · Bermuda Principles
  • ~8% gaps remained — closed by T2T in 2022
Next lecture

If 99% was already done,
why finish it?

Chapter 2 · The Telomere-to-Telomere Project