BSMS205 · Genetics

NGS Applications
for Genomics

Chapter 6 · Part I · The Human Genome
A question to start with

If sequencing is cheap,
why not always read
everything?

Two ways to use NGS

WES

  • Whole-Exome Sequencing
  • Read only the coding 1–2%
  • ~30–50 Mb captured

WGS

  • Whole-Genome Sequencing
  • Read everything
  • ~3.2 Gb · 100% coverage

Why focus on the exome?

85%
of known disease mutations sit in exons
  • Exons are 1–2% of the genome
  • But carry the vast majority of pathogenic variants
  • Why? Exons code for proteins directly

Roadmap for today

  1. What is the exome — and why it matters
  2. WES vs WGS · head to head
  3. How target capture actually works
  4. From reads to variants · the pipeline
  5. Clinical decision tree · which to order
  6. Cost, scale, and the future
  7. Summary & what comes next
§ 1

Where Disease
Variants Hide

Exons, introns, and proteins

  • Each gene = exons + introns
  • Exons: 50–200 bp each · code for protein
  • Introns: often thousands of bp · spliced out
  • Final mRNA = exons only → translated to protein

The exome by the numbers

~30–50
Mb
total exonic sequence
  • ~20,000 protein-coding genes
  • ~180,000 exons total
  • 1–2% of the 3.2 Gb genome
  • ~85% of disease mutations sit here
§ 2

WES vs WGS
Head to Head

What can each one see?

WES vs WGS coverage diagram
Figure 1. WES targets only coding exons (~1–2% of the genome) where most disease variants reside. WGS reads everything — exons, introns, regulatory regions — at higher cost and complexity.

Side by side

FeatureWESWGS
Coverage1–2% of genome100% of genome
Target size30–50 Mb3.2 Gb
Data per sample~6 GB~90–100 GB
Typical depth100–150×30–40×
Cost (2024)~$400–500~$600–1,000
Diagnostic yield25–50%30–55%

What each one misses

WES misses

  • Structural variants
  • Deep intronic variants
  • Regulatory regions
  • Repeat expansions

WGS catches all of these

  • Deletions, duplications, inversions
  • Cryptic splice sites
  • Promoter / enhancer mutations
  • Triplet repeats (with long reads)
§ 3

Target Capture
How WES Works

The capture trick · 7 steps

  1. Fragment genomic DNA
  2. Add biotinylated baits · complementary to exons
  3. Baits hybridize with exonic fragments
  4. Add streptavidin magnetic beads
  5. Pull beads with a magnet → captured exons
  6. Wash away introns & intergenic DNA
  7. Sequence the captured fragments

The bait-and-bead analogy

Like using a magnet
to pull metal pieces
out of a mixed pile.
  • The "metal" = exons (tagged with biotin via baits)
  • The "magnet" = streptavidin beads
  • Everything non-magnetic gets washed away

Why capture works · and where it fails

Target capture: success vs structural variant failure
Figure 2. Top: intact exons → baits bind, sequence captured. Bottom: a deletion removes the exon → nothing for the bait to grab → variant invisible to WES. This is why WES misses 5–10% of pathogenic variants.

The historic case · Miller syndrome

  • 2010 · four affected siblings · rare facial disorder
  • Strategy: WES on all four · find shared rare variants
  • Result: novel variants in DHODH
  • Took months, not years · cost thousands, not millions
The paper that launched the WES era.
§ 4

From Reads
to Variants

The pipeline in one figure

Bioinformatics pipeline: reads to variants
Figure 3. QC removes bad bases, alignment maps reads to the reference, variant calling identifies differences, filtering removes artifacts, annotation adds biological context. FASTQ → BAM → VCF.

Step 1 · Quality scores

Q scoreAccuracyError rate
Q2099%1 / 100
Q3099.9%1 / 1,000
Q4099.99%1 / 10,000

Tool: FastQC · trim low-quality ends, remove adapters.

Step 2 · Alignment to reference

  • Each read finds its best matching position on the genome
  • BWA · gold standard for Illumina short reads
  • Minimap2 · designed for long reads
  • Output: BAM file (Binary Alignment Map)
Repetitive regions create ambiguity for short reads.

Step 3 · Variant calling

  • Stack reads at each position · count bases
  • Compare to reference · compute likelihood of real variant vs error
  • Need ≥20–30 reads for confidence
chr1:12345 ref=G · 28 reads = A (Q30+) · 2 reads = G
→ likely homozygous A/A

The variant-calling toolbox

  • GATK · gold standard from the Broad Institute
  • FreeBayes · Bayesian variant caller
  • DeepVariant · deep learning approach (Google)

Output: a VCF file (Variant Call Format).

Step 4 · Filter the artifacts

  • Minimum depth · ≥10–20 reads
  • Quality threshold · ≥20 or ≥30
  • Strand bias · reads from one strand only = artifact
  • Allele balance · het should be ~50/50

Step 5 · Annotate

  • Location · gene? exon, intron, intergenic?
  • Effect · synonymous, missense, nonsense, frameshift, splice
  • Frequency · how common in gnomAD?
  • Clinical · is it in ClinVar? Pathogenic? VUS?
  • Tools: VEP, ANNOVAR, SnpEff

The complete workflow · 10 days

DaysStep
1–3DNA extraction · library prep · capture (WES)
4–5Sequencing on NovaSeq 6000
6–7QC, alignment, dedup, variant calling
8–10Filter · annotate · interpret · validate

From sample to genetic diagnosis in about 10 days.

§ 5

Which Test
to Order?

The decision tree

Clinical decision tree for WES vs WGS
Figure 4. Start with WES for suspected Mendelian disorders with known coding variants. Move to WGS when WES is negative, structural variants are suspected, or the phenotype is complex.

Start with WES if...

  • Patient features fit a known Mendelian disorder
  • Need to screen many genes at once (e.g., hearing loss → 100+ genes)
  • Cost matters · large studies, resource-limited setting
  • Phenotype suggests a coding variant · loss of protein function

Move to WGS if...

  • WES negative · still suspect genetic cause
  • Suspect a structural variant · multiple anomalies, dysmorphic features
  • Phenotype is complex or atypical · regulatory variant possible
  • Want future-proof data for reanalysis
WGS solves 10–30% of WES-negative cases.

A real case · the boy who needed WGS

  • Patient · 8-year-old boy · ID, autism, dysmorphic features
  • Microarray → normal
  • WES → variants of uncertain significance, nothing diagnostic
  • WGS150 kb deletion removing 5 exons of a neurodev gene

Too small for microarray · invisible to WES · resolved by WGS.

§ 6

Cost, Scale,
and the Future

The cost gap is closing

YearWESWGSRatio
2010~$5,000~$50,00010×
2015~$1,000~$5,000
2020~$500~$1,000
2024~$400–500~$600–1,0001.5×

As the gap closes, the case for WES weakens.

Storage matters at scale

Per sample

  • WES · ~6 GB
  • WGS · ~90 GB

1,000 samples

  • WES · 6 TB
  • WGS · 90 TB

UK Biobank · 500,000 genomes × 90 GB = 45 PB

The interpretation bottleneck

WES

  • ~20,000 variants / person
  • Mostly coding · known effects
  • Analyzed in days

WGS

  • 4–5 million variants / person
  • Most non-coding · uncertain
  • Takes weeks
"We went from being starved for data to drowning in it."

What's pushing toward universal WGS

  • Cost convergence · WES and WGS nearly equal
  • Long reads · PacBio & Nanopore handle SVs and repeats
  • AI interpretation · machine learning for non-coding variants
  • Yield · WGS consistently 5–10% higher than WES

But WES isn't dead yet

  • Large population studies · storage / cost dominates
  • Focused disease studies · known coding genes only
  • Lower-income settings · every dollar counts
  • High-coverage needs · WES gets 100×+ for the same money

The "WGS now, exome first" strategy

Sequence the whole genome.
Analyze just the exons first.
Keep the rest for later.
  • Immediate WES-equivalent diagnosis
  • Expand to non-coding if needed
  • Reanalyze as databases improve
§ 7

Summary

What to take away

  • WES · 1–2% of genome · 85% of disease variants · cheap & fast
  • Target capture · biotin baits + magnetic beads · misses SVs
  • Pipeline · FASTQ → BAM → VCF · QC, align, call, filter, annotate
  • Decision tree · WES first, WGS for negative or complex cases
  • Cost gap is closing · field is shifting toward universal WGS
Next lecture

We get millions of variants.
How do we decide
which one matters?

Chapter 7 · Variant Annotation & Genomic Databases