BSMS205 · Genetics

Next-Generation
Sequencing

Chapter 5 · Part I · The Human Genome

A question to start with

How do you read
3 billion letters
of DNA?

From Chapter 4 to here

Pangenome needed 47+ genomes from diverse populations
Each one assembled to T2T-quality
That is impossible at $300 million per genome
The pangenome presupposes cheap sequencing

No NGS · no pangenome · no modern genetics.

The audacious claim

500,000×

cost reduction in just over a decade

2003: $300 million per genome
2014: $1,000 per genome
Today: under $600 per genome

Roadmap for today

The Sanger bottleneck · why we needed something new
The NGS idea · massive parallelization
Key concepts · reads, coverage, depth, quality
Illumina · the short-read workhorse
PacBio & Nanopore · long reads
The cost crash · economics of sequencing
Summary & what comes next

§ 1

The Sanger
Bottleneck

Sanger · the gold standard

Developed in the 1970s by Frederick Sanger
Sanger's second Nobel Prize
Error rate < 0.1% · extremely accurate
Reads one fragment at a time · serial

The transcription analogy

Imagine transcribing a million books
but you can only copy one sentence at a time.

That is Sanger sequencing for the genome
3 billion letters · one fragment at a time
Even hundreds of machines running 24/7 · years

The bottleneck, visualized

Sanger sequencing as a serial bottleneck — **Figure 1.** Sanger sequencing processes DNA fragments sequentially, one at a time. Reading a whole genome means reading billions of fragments in series — extremely slow and expensive at scale.

What the field needed

Sequence not just one genome but thousands
Compare healthy vs sick · find disease variants
Study variation across populations
Diagnose patients in days, not decades

At $300M per genome · this was a fantasy.

§ 2

The NGS Idea
Massive Parallelism

One simple change

Sanger

One fragment at a time
Serial · slow
Years per genome

NGS

Millions at a time
Parallel · fast
Days per genome

Same chemistry idea — radically different scale.

The book analogy returns

One person copying sentences sequentially
becomes a million people, each on a different sentence.

Years collapse into days
Same total work · totally different time
That is the NGS bet

Massive parallelism, visualized

Sanger vs NGS — serial vs massively parallel — **Figure 2.** Sanger sequences one fragment at a time. NGS sequences millions of fragments simultaneously. The same chemistry idea, scaled by parallelism, dropped sequencing time from years to days.

From idea to industry · 2007 onwards

2007 · first NGS platforms commercially available
2014 · cost falls to ~$1,000 per genome
2024 · ~$600 per Illumina genome
Cost dropped faster than Moore's Law

§ 3

Key Concepts
You'll Need

What is a "read"?

DNA is broken into fragments
Each fragment is sequenced → produces a read
A read = the letters of one fragment
One run = billions of reads

Read length varies by platform

Platform	Typical read length
Illumina	50 – 300 bp
PacBio Revio (HiFi)	~15,000 – 20,000 bp
Oxford Nanopore	up to > 1,000,000 bp

Illumina = short read. PacBio & Nanopore = long read.

Coverage vs depth

Coverage

What fraction of the genome was sequenced
e.g. 95% coverage = 5% missing

Depth

How many times each base was read
e.g. 30× depth = avg. 30 reads per base

Different questions · often confused.

Why depth matters

Sequencing is not perfect · errors happen
Need multiple reads per base to call variants confidently
20 of 30 reads agree → real variant
1 of 2 reads disagree → probably error

30× is the minimum for reliable human variant detection.

Reads, coverage, depth · in one figure

Reads, coverage, and depth explained — **Figure 3.** Coverage = what fraction of the genome was sequenced at all. Depth = how many times each position was read on average. Higher depth → more confidence in distinguishing real variants from sequencing errors.

Quality scores · how confident is the call?

Q score	Confidence	Error rate
Q20	99%	1 / 100
Q30	99.9%	1 / 1,000
Q40	99.99%	1 / 10,000

Q30+ bases = the gold standard for variant calling.

§ 4

Illumina
The Workhorse

Sequencing by synthesis

DNA fragments stick to a flat flow cell
Each fragment is amplified into a cluster
Add fluorescent A, C, G, T — one base at a time
Camera snaps a picture of every cluster
Wash · cleave · repeat · for hundreds of cycles

Why this works at scale

Millions of clusters · one camera shot reads all of them
Every cycle = one new base for every cluster
Hundreds of cycles → ~200 bp reads
Billions of bases per run · routine

The flow cell is the parallelism made physical.

Illumina · strengths & limits

Strengths

High throughput
Low cost (~$600/genome)
~0.1% error · Q30+
Mature tools & pipelines

Limits

Short reads (~150 – 300 bp)
Trouble in repeats
PCR can add bias

When you reach for Illumina

Whole-genome sequencing · most clinical & research projects
Whole-exome sequencing · just the protein-coding regions
SNPs & small indels · routine variant calling
RNA-seq · gene expression measurement
Most clinical diagnostics today

§ 5

The Long-Read
Revolution

Why read length matters

Short reads cannot span long repeats; long reads can — **Figure 4.** Short reads cannot span long repetitive regions — they look identical from inside the repeat, leaving gaps. Long reads span the entire repeat, anchored in unique sequence at both ends.

PacBio · Single-Molecule Real-Time

One DNA polymerase at the bottom of a tiny well (ZMW)
One DNA molecule threads through it
Each base added → brief flash of fluorescent light
Detector records flashes in real time
Revio chip = 25 million wells running at once

HiFi reads · long and accurate

Circular DNA template (SMRTbell) · polymerase loops repeatedly
Same molecule sequenced many times
Consensus → 15 – 20 kb at Q30+
Accuracy comparable to Illumina · with 100× longer reads

Oxford Nanopore · current through a pore

A tiny protein nanopore sits in a membrane
Electrical current flows through the pore
DNA threads through · each base disrupts the current differently
Pattern of current changes = the sequence

No fluorescence · no camera · just electricity.

What Nanopore enables

Ultra-long reads · > 1 million bp possible
Real-time · data streams as DNA is read
Portable · MinION = USB-stick size
Used in Ebola outbreaks, Antarctica, and on the ISS

Three platforms · side by side

Platform	Read length	Accuracy	Cost / genome	Best for
Illumina	150 – 300 bp	~99.9% (Q30)	$600 – $1,000	Routine WGS / WES, variants, clinical
PacBio Revio	15 – 20 kb (HiFi)	~99.9% (Q30)	~$1,500 – $2,000	Assembly, structural variants, repeats
Nanopore	up to > 1 Mb	~95 – 99%	variable	Ultra-long, rapid, field sequencing

How to choose

Need variants in 1,000 patients? → Illumina
Need a complete assembly? → PacBio HiFi (+ Nanopore)
Need an answer in the field, today? → Nanopore
Need it all? → Combine platforms

T2T & pangenome both combined PacBio + Nanopore + Illumina.

§ 6

The Cost
Crash

Cost per human genome

Year	Cost	Note
2003	~$300 million	HGP completion
2007	~$10 million	First NGS platforms
2010	~$50,000	NGS scales up
2014	~$1,000	"$1,000 genome" hit
2024	~$500 – 600	Illumina · routine

Faster than Moore's Law

Computers got cheaper exponentially · Moore's Law
NGS got cheaper faster than computers
Especially after 2008, when NGS scaled up
Genomics had its own curve · steeper than tech

Sequencing went from elite project to clinical tool.

What cheap sequencing enabled

GWAS · thousands of cases vs controls
WES diagnosis · 25 – 50% solve rate for rare disease
dbSNP · over 1.1 billion variants cataloged
T2T & pangenome · only possible at NGS prices

What did cheap sequencing actually buy us? Four big things. One — genome-wide association studies, where you compare thousands of people with and without a disease across millions of variants. Diabetes, schizophrenia, heart disease, hundreds of others now have known genetic risk factors discovered this way. Two — whole-exome sequencing as a clinical diagnostic, which now solves twenty-five to fifty percent of previously undiagnosed rare disease cases. Three — dbSNP, the public variant database, now contains over one point one billion recorded variants. Four — every flagship genomics project, including the T2T finishing of the genome and the pangenome we discussed last lecture, was only possible because sequencing got cheap. Cheap sequencing is the platform underneath modern genetics.

§ 7

Summary

What to take away

Sanger was serial · the bottleneck for genome-scale work
NGS = massive parallelism · millions of reads at once
Vocabulary: read · coverage · depth · Q30
Illumina (short, cheap) · PacBio (long, accurate) · Nanopore (ultra-long, portable)
Cost: $300M → <$1,000 · faster than Moore's Law

Five things to take away. One — Sanger sequencing was accurate but serial, and that serialness was the bottleneck for genome-scale work. Two — NGS replaced serial with massive parallelism, doing millions of reactions at once on a single flow cell. Three — the vocabulary you need is read, coverage, depth, and Q score, with Q thirty as the practical accuracy threshold and thirty-times depth as the human variant calling minimum. Four — three platforms dominate today: Illumina for routine short-read work, PacBio for long accurate reads, and Nanopore for ultra-long and portable. Five — sequencing cost dropped from three hundred million dollars per genome to under one thousand, faster than Moore's Law, and that crash is the foundation for everything else in modern genomics.

Next lecture

Now that sequencing is cheap,
what do we use it for?

Chapter 6 · Applications of NGS