BSMS205 · Genetics

Next-Generation
Sequencing

Chapter 5 · Part I · The Human Genome
A question to start with

How do you read
3 billion letters
of DNA?

From Chapter 4 to here

  • Pangenome needed 47+ genomes from diverse populations
  • Each one assembled to T2T-quality
  • That is impossible at $300 million per genome
  • The pangenome presupposes cheap sequencing
No NGS · no pangenome · no modern genetics.
The audacious claim
500,000×
cost reduction in just over a decade
  • 2003: $300 million per genome
  • 2014: $1,000 per genome
  • Today: under $600 per genome

Roadmap for today

  1. The Sanger bottleneck · why we needed something new
  2. The NGS idea · massive parallelization
  3. Key concepts · reads, coverage, depth, quality
  4. Illumina · the short-read workhorse
  5. PacBio & Nanopore · long reads
  6. The cost crash · economics of sequencing
  7. Summary & what comes next
§ 1

The Sanger
Bottleneck

Sanger · the gold standard

  • Developed in the 1970s by Frederick Sanger
  • Sanger's second Nobel Prize
  • Error rate < 0.1% · extremely accurate
  • Reads one fragment at a time · serial

The transcription analogy

Imagine transcribing a million books
but you can only copy one sentence at a time.
  • That is Sanger sequencing for the genome
  • 3 billion letters · one fragment at a time
  • Even hundreds of machines running 24/7 · years

The bottleneck, visualized

Sanger sequencing as a serial bottleneck
Figure 1. Sanger sequencing processes DNA fragments sequentially, one at a time. Reading a whole genome means reading billions of fragments in series — extremely slow and expensive at scale.

What the field needed

  • Sequence not just one genome but thousands
  • Compare healthy vs sick · find disease variants
  • Study variation across populations
  • Diagnose patients in days, not decades
At $300M per genome · this was a fantasy.
§ 2

The NGS Idea
Massive Parallelism

One simple change

Sanger

  • One fragment at a time
  • Serial · slow
  • Years per genome

NGS

  • Millions at a time
  • Parallel · fast
  • Days per genome

Same chemistry idea — radically different scale.

The book analogy returns

One person copying sentences sequentially
becomes a million people, each on a different sentence.
  • Years collapse into days
  • Same total work · totally different time
  • That is the NGS bet

Massive parallelism, visualized

Sanger vs NGS — serial vs massively parallel
Figure 2. Sanger sequences one fragment at a time. NGS sequences millions of fragments simultaneously. The same chemistry idea, scaled by parallelism, dropped sequencing time from years to days.

From idea to industry · 2007 onwards

  • 2007 · first NGS platforms commercially available
  • 2014 · cost falls to ~$1,000 per genome
  • 2024 · ~$600 per Illumina genome
  • Cost dropped faster than Moore's Law
§ 3

Key Concepts
You'll Need

What is a "read"?

  • DNA is broken into fragments
  • Each fragment is sequenced → produces a read
  • A read = the letters of one fragment
  • One run = billions of reads

Read length varies by platform

PlatformTypical read length
Illumina50 – 300 bp
PacBio Revio (HiFi)~15,000 – 20,000 bp
Oxford Nanoporeup to > 1,000,000 bp

Illumina = short read. PacBio & Nanopore = long read.

Coverage vs depth

Coverage

  • What fraction of the genome was sequenced
  • e.g. 95% coverage = 5% missing

Depth

  • How many times each base was read
  • e.g. 30× depth = avg. 30 reads per base
Different questions · often confused.

Why depth matters

  • Sequencing is not perfect · errors happen
  • Need multiple reads per base to call variants confidently
  • 20 of 30 reads agree → real variant
  • 1 of 2 reads disagree → probably error

30× is the minimum for reliable human variant detection.

Reads, coverage, depth · in one figure

Reads, coverage, and depth explained
Figure 3. Coverage = what fraction of the genome was sequenced at all. Depth = how many times each position was read on average. Higher depth → more confidence in distinguishing real variants from sequencing errors.

Quality scores · how confident is the call?

Q scoreConfidenceError rate
Q2099%1 / 100
Q3099.9%1 / 1,000
Q4099.99%1 / 10,000

Q30+ bases = the gold standard for variant calling.

§ 4

Illumina
The Workhorse

Sequencing by synthesis

  1. DNA fragments stick to a flat flow cell
  2. Each fragment is amplified into a cluster
  3. Add fluorescent A, C, G, T — one base at a time
  4. Camera snaps a picture of every cluster
  5. Wash · cleave · repeat · for hundreds of cycles

Why this works at scale

  • Millions of clusters · one camera shot reads all of them
  • Every cycle = one new base for every cluster
  • Hundreds of cycles → ~200 bp reads
  • Billions of bases per run · routine
The flow cell is the parallelism made physical.

Illumina · strengths & limits

Strengths

  • High throughput
  • Low cost (~$600/genome)
  • ~0.1% error · Q30+
  • Mature tools & pipelines

Limits

  • Short reads (~150 – 300 bp)
  • Trouble in repeats
  • PCR can add bias

When you reach for Illumina

  • Whole-genome sequencing · most clinical & research projects
  • Whole-exome sequencing · just the protein-coding regions
  • SNPs & small indels · routine variant calling
  • RNA-seq · gene expression measurement
  • Most clinical diagnostics today
§ 5

The Long-Read
Revolution

Why read length matters

Short reads cannot span long repeats; long reads can
Figure 4. Short reads cannot span long repetitive regions — they look identical from inside the repeat, leaving gaps. Long reads span the entire repeat, anchored in unique sequence at both ends.

PacBio · Single-Molecule Real-Time

  • One DNA polymerase at the bottom of a tiny well (ZMW)
  • One DNA molecule threads through it
  • Each base added → brief flash of fluorescent light
  • Detector records flashes in real time
  • Revio chip = 25 million wells running at once

HiFi reads · long and accurate

  • Circular DNA template (SMRTbell) · polymerase loops repeatedly
  • Same molecule sequenced many times
  • Consensus → 15 – 20 kb at Q30+
  • Accuracy comparable to Illumina · with 100× longer reads

Oxford Nanopore · current through a pore

  • A tiny protein nanopore sits in a membrane
  • Electrical current flows through the pore
  • DNA threads through · each base disrupts the current differently
  • Pattern of current changes = the sequence
No fluorescence · no camera · just electricity.

What Nanopore enables

  • Ultra-long reads · > 1 million bp possible
  • Real-time · data streams as DNA is read
  • Portable · MinION = USB-stick size
  • Used in Ebola outbreaks, Antarctica, and on the ISS

Three platforms · side by side

PlatformRead lengthAccuracyCost / genomeBest for
Illumina 150 – 300 bp ~99.9% (Q30) $600 – $1,000 Routine WGS / WES, variants, clinical
PacBio Revio 15 – 20 kb (HiFi) ~99.9% (Q30) ~$1,500 – $2,000 Assembly, structural variants, repeats
Nanopore up to > 1 Mb ~95 – 99% variable Ultra-long, rapid, field sequencing

How to choose

  • Need variants in 1,000 patients? → Illumina
  • Need a complete assembly? → PacBio HiFi (+ Nanopore)
  • Need an answer in the field, today? → Nanopore
  • Need it all? → Combine platforms
T2T & pangenome both combined PacBio + Nanopore + Illumina.
§ 6

The Cost
Crash

Cost per human genome

YearCostNote
2003~$300 millionHGP completion
2007~$10 millionFirst NGS platforms
2010~$50,000NGS scales up
2014~$1,000"$1,000 genome" hit
2024~$500 – 600Illumina · routine

Faster than Moore's Law

  • Computers got cheaper exponentially · Moore's Law
  • NGS got cheaper faster than computers
  • Especially after 2008, when NGS scaled up
  • Genomics had its own curve · steeper than tech
Sequencing went from elite project to clinical tool.

What cheap sequencing enabled

  • GWAS · thousands of cases vs controls
  • WES diagnosis · 25 – 50% solve rate for rare disease
  • dbSNP · over 1.1 billion variants cataloged
  • T2T & pangenome · only possible at NGS prices
§ 7

Summary

What to take away

  • Sanger was serial · the bottleneck for genome-scale work
  • NGS = massive parallelism · millions of reads at once
  • Vocabulary: read · coverage · depth · Q30
  • Illumina (short, cheap) · PacBio (long, accurate) · Nanopore (ultra-long, portable)
  • Cost: $300M → <$1,000 · faster than Moore's Law
Next lecture

Now that sequencing is cheap,
what do we use it for?

Chapter 6 · Applications of NGS