BSMS205 · Genetics

NGS Applications
for Genomics

Chapter 6 · Part I · The Human Genome

A question to start with

If sequencing is cheap,
why not always read
everything?

Two ways to use NGS

WES

Whole-Exome Sequencing
Read only the coding 1–2%
~30–50 Mb captured

WGS

Whole-Genome Sequencing
Read everything
~3.2 Gb · 100% coverage

Why focus on the exome?

85%

of known disease mutations sit in exons

Exons are 1–2% of the genome
But carry the vast majority of pathogenic variants
Why? Exons code for proteins directly

Roadmap for today

What is the exome — and why it matters
WES vs WGS · head to head
How target capture actually works
From reads to variants · the pipeline
Clinical decision tree · which to order
Cost, scale, and the future
Summary & what comes next

§ 1

Where Disease
Variants Hide

Exons, introns, and proteins

Each gene = exons + introns
Exons: 50–200 bp each · code for protein
Introns: often thousands of bp · spliced out
Final mRNA = exons only → translated to protein

The exome by the numbers

~30–50
Mb

total exonic sequence

~20,000 protein-coding genes
~180,000 exons total
1–2% of the 3.2 Gb genome
~85% of disease mutations sit here

§ 2

WES vs WGS
Head to Head

What can each one see?

WES vs WGS coverage diagram — **Figure 1.** WES targets only coding exons (~1–2% of the genome) where most disease variants reside. WGS reads everything — exons, introns, regulatory regions — at higher cost and complexity.

Side by side

Feature	WES	WGS
Coverage	1–2% of genome	100% of genome
Target size	30–50 Mb	3.2 Gb
Data per sample	~6 GB	~90–100 GB
Typical depth	100–150×	30–40×
Cost (2024)	~$400–500	~$600–1,000
Diagnostic yield	25–50%	30–55%

What each one misses

WES misses

Structural variants
Deep intronic variants
Regulatory regions
Repeat expansions

WGS catches all of these

Deletions, duplications, inversions
Cryptic splice sites
Promoter / enhancer mutations
Triplet repeats (with long reads)

§ 3

Target Capture
How WES Works

The capture trick · 7 steps

Fragment genomic DNA
Add biotinylated baits · complementary to exons
Baits hybridize with exonic fragments
Add streptavidin magnetic beads
Pull beads with a magnet → captured exons
Wash away introns & intergenic DNA
Sequence the captured fragments

The bait-and-bead analogy

Like using a magnet
to pull metal pieces
out of a mixed pile.

The "metal" = exons (tagged with biotin via baits)
The "magnet" = streptavidin beads
Everything non-magnetic gets washed away

Why capture works · and where it fails

Target capture: success vs structural variant failure — **Figure 2.** Top: intact exons → baits bind, sequence captured. Bottom: a deletion removes the exon → nothing for the bait to grab → variant invisible to WES. This is why WES misses 5–10% of pathogenic variants.

The historic case · Miller syndrome

2010 · four affected siblings · rare facial disorder
Strategy: WES on all four · find shared rare variants
Result: novel variants in DHODH
Took months, not years · cost thousands, not millions

The paper that launched the WES era.

§ 4

From Reads
to Variants

The pipeline in one figure

Bioinformatics pipeline: reads to variants — **Figure 3.** QC removes bad bases, alignment maps reads to the reference, variant calling identifies differences, filtering removes artifacts, annotation adds biological context. FASTQ → BAM → VCF.

Step 1 · Quality scores

Q score	Accuracy	Error rate
Q20	99%	1 / 100
Q30	99.9%	1 / 1,000
Q40	99.99%	1 / 10,000

Tool: FastQC · trim low-quality ends, remove adapters.

Step 2 · Alignment to reference

Each read finds its best matching position on the genome
BWA · gold standard for Illumina short reads
Minimap2 · designed for long reads
Output: BAM file (Binary Alignment Map)

Repetitive regions create ambiguity for short reads.

Step 3 · Variant calling

Stack reads at each position · count bases
Compare to reference · compute likelihood of real variant vs error
Need ≥20–30 reads for confidence

chr1:12345 ref=G · 28 reads = A (Q30+) · 2 reads = G
→ likely homozygous A/A

The variant-calling toolbox

GATK · gold standard from the Broad Institute
FreeBayes · Bayesian variant caller
DeepVariant · deep learning approach (Google)

Output: a VCF file (Variant Call Format).

Step 4 · Filter the artifacts

Minimum depth · ≥10–20 reads
Quality threshold · ≥20 or ≥30
Strand bias · reads from one strand only = artifact
Allele balance · het should be ~50/50

Step 5 · Annotate

Location · gene? exon, intron, intergenic?
Effect · synonymous, missense, nonsense, frameshift, splice
Frequency · how common in gnomAD?
Clinical · is it in ClinVar? Pathogenic? VUS?
Tools: VEP, ANNOVAR, SnpEff

The complete workflow · 10 days

Days	Step
1–3	DNA extraction · library prep · capture (WES)
4–5	Sequencing on NovaSeq 6000
6–7	QC, alignment, dedup, variant calling
8–10	Filter · annotate · interpret · validate

From sample to genetic diagnosis in about 10 days.

§ 5

Which Test
to Order?

The decision tree

Clinical decision tree for WES vs WGS — **Figure 4.** Start with WES for suspected Mendelian disorders with known coding variants. Move to WGS when WES is negative, structural variants are suspected, or the phenotype is complex.

Start with WES if...

Patient features fit a known Mendelian disorder
Need to screen many genes at once (e.g., hearing loss → 100+ genes)
Cost matters · large studies, resource-limited setting
Phenotype suggests a coding variant · loss of protein function

Move to WGS if...

WES negative · still suspect genetic cause
Suspect a structural variant · multiple anomalies, dysmorphic features
Phenotype is complex or atypical · regulatory variant possible
Want future-proof data for reanalysis

WGS solves 10–30% of WES-negative cases.

A real case · the boy who needed WGS

Patient · 8-year-old boy · ID, autism, dysmorphic features
Microarray → normal
WES → variants of uncertain significance, nothing diagnostic
WGS → 150 kb deletion removing 5 exons of a neurodev gene

Too small for microarray · invisible to WES · resolved by WGS.

§ 6

Cost, Scale,
and the Future

The cost gap is closing

Year	WES	WGS	Ratio
2010	~$5,000	~$50,000	10×
2015	~$1,000	~$5,000	5×
2020	~$500	~$1,000	2×
2024	~$400–500	~$600–1,000	1.5×

As the gap closes, the case for WES weakens.

Storage matters at scale

Per sample

WES · ~6 GB
WGS · ~90 GB

1,000 samples

WES · 6 TB
WGS · 90 TB

UK Biobank · 500,000 genomes × 90 GB = 45 PB

The interpretation bottleneck

WES

~20,000 variants / person
Mostly coding · known effects
Analyzed in days

WGS

4–5 million variants / person
Most non-coding · uncertain
Takes weeks

"We went from being starved for data to drowning in it."

What's pushing toward universal WGS

Cost convergence · WES and WGS nearly equal
Long reads · PacBio & Nanopore handle SVs and repeats
AI interpretation · machine learning for non-coding variants
Yield · WGS consistently 5–10% higher than WES

But WES isn't dead yet

Large population studies · storage / cost dominates
Focused disease studies · known coding genes only
Lower-income settings · every dollar counts
High-coverage needs · WES gets 100×+ for the same money

The "WGS now, exome first" strategy

Sequence the whole genome.
Analyze just the exons first.
Keep the rest for later.

Immediate WES-equivalent diagnosis
Expand to non-coding if needed
Reanalyze as databases improve

§ 7

Summary

What to take away

WES · 1–2% of genome · 85% of disease variants · cheap & fast
Target capture · biotin baits + magnetic beads · misses SVs
Pipeline · FASTQ → BAM → VCF · QC, align, call, filter, annotate
Decision tree · WES first, WGS for negative or complex cases
Cost gap is closing · field is shifting toward universal WGS

Five things to take away. One — exome sequencing reads only one to two percent of the genome but captures about eighty-five percent of known disease variants. It is cheap and fast and remains the workhorse of clinical genetics. Two — the trick that defines exome sequencing is target capture using biotinylated baits and streptavidin magnetic beads, and that trick fundamentally cannot detect structural variants because there is nothing for the bait to grab. Three — the bioinformatics pipeline goes FASTQ to BAM to VCF, with quality control, alignment, variant calling, filtering, and annotation as the five steps. Memorize that flow. Four — clinically, you start with exome and escalate to whole genome when exome is negative or when you suspect a structural variant. Five — the cost gap between WES and WGS is shrinking, and the field is moving toward universal whole-genome sequencing as the default.

Next lecture

We get millions of variants.
How do we decide
which one matters?

Chapter 7 · Variant Annotation & Genomic Databases