BSMS205 · Genetics

Genetic
Variants

Chapter 8 · Part II · Variation
A question to start with

How is your DNA
different from mine?

Where we left off · where we go now

Chapter 7

  • How we annotate variants
  • gnomAD · ClinVar · pLI
  • Frequency & pathogenicity tags

Chapter 8

  • What kinds of variants exist
  • Size · location · effect · frequency
  • From millions to one
The reality of one human genome
4–5,000,000
variants per person · vs the reference
  • One change every 600–800 bases
  • ~500,000 indels
  • ~1,000–2,000 structural variants

Roadmap for today

  1. What is a variant? · terminology
  2. The classification framework · four axes
  3. Variants by size · SNV · indel · SV
  4. Variants by effect · the coding five
  5. Non-coding variants · the harder 98%
  6. From millions to one · the filtering funnel
  7. Summary · what comes next
§ 1

What Is
a Variant?

The definition

A genetic variant is a difference in DNA sequence
between an individual's genome and the reference.
  • Reference = an agreed-upon standard sequence
  • Variant = any deviation from it
  • Could be 1 base · could be 1 megabase

One base, one variant — visualized

Reference genome compared to individual genome with a single G to T change
Figure 1. Reference vs individual genome. A single G→T change at one position is one variant. Every human carries 4–5 million such differences.

Variant · mutation · polymorphism · SNP

TermMeaningConnotation
VariantAny DNA differenceNeutral — default
MutationDNA change, often rareImplies disease
PolymorphismCommon variant (>1%)Usually benign
SNPSingle Nucleotide Polymorphism"snip" · common, single base

Examples — same word, different meaning

  • "CFTR mutation" · rare disease change · cystic fibrosis
  • "ABO polymorphism" · common blood-type variation
  • "Novel variant" · newly seen · significance unknown
  • "SNP rs334" · the sickle-cell variant in HBB
In this course we say variant by default.
§ 2

Four Axes
of Classification

Four perspectives on one variant

Four-axis classification framework: size, location, effect, frequency
Figure 2. Every variant lives on four axes: by size, by location, by effect, by frequency. The same variant can be described by all four at once.

The four axes — at a glance

AxisCategoriesWhy it matters
SizeSNV · indel · SVDrives detection method
LocationCoding · regulatory · intronicDrives interpretability
EffectLoF · missense · synonymousDrives pathogenicity
FrequencyCommon · rareDrives clinical filtering
Worked example

One variant · four labels

  • Size: 1 base — SNV
  • Location: coding exon of HBB
  • Effect: missense (Glu→Val)
  • Frequency: common in West Africa, rare globally
That is the sickle-cell variant · rs334.
§ 3

Variants
by Size

Three size classes

Variant types by size: SNV, indel, structural variant
Figure 3. SNVs (1 bp), indels (1–50 bp), structural variants (>50 bp). Three size classes spanning six orders of magnitude.

SNVs · single base changes

Reference: ...ATGCGATCG...
Your DNA:  ...ATGCTATCG...
                  ↑
                G→T
  • ~4–5 million per genome
  • One every 600–800 bases
  • The "bread and butter" of variant calling
  • Easiest to detect with WGS or WES

Indels · insertions and deletions

Reference:  ...ATGCGATCG...
Deletion:   ...ATG---TCG...    (CGA deleted, 3 bp)
Insertion:  ...ATGCAAAGATCG... (AAA inserted, 3 bp)
  • Typically 1–50 bp
  • ~400,000–500,000 per genome
  • Slightly harder to call than SNVs
  • Especially tricky in repetitive regions

Structural variants · the big stuff

  • >50 bp · often >1 kb
  • Deletions — can remove whole genes
  • Duplications — extra copies
  • Inversions — segment flipped
  • Translocations — moved between chromosomes
  • CNVs — copy number variation

SVs are rare in count, huge in impact

~1,500
SVs per person
  • Only ~1,500 events
  • But affect more total bases than all SNVs combined
  • One large deletion = millions of changed bases

Detection — different sizes need different tech

TechnologySNVIndelSV
Short-read WGS (Illumina)ExcellentGoodLimited
Short-read WESExcellent (in exons)Good (in exons)Very poor
Long-read WGS (PacBio/Nanopore)GoodExcellentExcellent
§ 4

Coding
Consequences

The 98 / 2 rule

2%
coding · interpretable
98%
non-coding · harder

Most known disease variants live in the 2%.

The genetic code reads in 3-base codons

DNA:     ATG  CAT  GCA  TTG  AAA
              ↓    ↓    ↓    ↓    ↓
Protein: Met  His  Ala  Leu  Lys
  • Each codon = 3 bases = 1 amino acid
  • The code is redundant — many codons per amino acid
  • Reading frame matters · shift it and everything changes

The five coding variant types

TypeDNA changeProtein effectPathogenicity
SynonymousGAA→GAGGlu→Glu (same)Usually benign
MissenseGAA→GCAGlu→AlaVariable
NonsenseCAG→TAGGln→STOPAlmost always harmful
Frameshift1 or 2 bp indelScrambled downstreamAlmost always harmful
In-frame indel3, 6, 9 bp±1 amino acidVariable

Synonymous · the silent ones

Reference: GAA  →  Glu (glutamate)
Variant:   GAG  →  Glu (glutamate) — same amino acid
  • ~25% of coding changes
  • Codon redundancy makes them silent at the protein level
  • Usually benign
  • But — can affect splicing or translation efficiency

Missense · the gray zone

Reference: GAA  →  Glu (glutamate, charged)
Variant:   GCA  →  Ala (alanine, hydrophobic)
  • One amino acid → a different one
  • Effect depends on which AA, where in protein
  • Conservative (similar AAs) → often benign
  • Radical (charged ↔ hydrophobic) → often harmful
  • Most VUS are missense

Nonsense · the truncators

Reference: CAG  →  Gln (glutamine)
Variant:   TAG  →  STOP
  • Premature stop codon → truncated protein
  • Often degraded by nonsense-mediated decay (NMD)
  • Effectively a knockout of one allele
  • Almost always pathogenic

Frameshift vs in-frame · the divisible-by-3 rule

Frameshift vs in-frame indel comparison showing reading frame consequences
Figure 4. Left: 1-bp deletion shifts the reading frame — wrong amino acids and an early stop. Right: 3-bp deletion preserves the frame — one amino acid removed, the rest intact.

Frameshift example · 1 bp deletion

Normal:      ATG | CAT | GCA | TTG | AAA
             Met - His - Ala - Leu - Lys

Frameshift:  ATG | CA_ | GCA | TTG | AAA   ← 1 bp deleted
re-read as:  ATG | CAG | CAT | TGA | AA...
             Met - Gln - His - STOP
  • Wrong amino acids from the deletion onward
  • Usually a premature stop within ~100 bp
  • Functionally a knockout

In-frame example · 3 bp deletion

Normal:      ATG | CAT | GCA | TTG | AAA
             Met - His - Ala - Leu - Lys

In-frame:    ATG | CAT | ___ | TTG | AAA   ← 3 bp deleted
             ATG | CAT | TTG | AAA
             Met - His - Leu - Lys ← one AA missing
  • Frame preserved · downstream reads correctly
  • Just one amino acid removed
  • Often partially functional
  • Example: EGFR in-frame del → hyperactive in lung cancer

LoF · the disease-aligned shorthand

Loss of Function (LoF) =
nonsense + frameshift + canonical splice + large deletion
  • All produce no functional protein from that allele
  • Subject to nonsense-mediated decay
  • The strongest single signal for pathogenicity
  • Combined with high pLI → very strong evidence
§ 5

The Other
98%

Splice site variants · breaking the cut

  • Genes contain introns (cut out) and exons (kept)
  • Splice donor: GT · acceptor: AG
  • Canonical splice variants (GT→AT, AG→AA) → splicing fails
  • Result: exon skipped or intron retained
  • Counted as LoF · example: BRCA2 splice → breast cancer

Regulatory variants · changing expression

ElementWhereEffect
PromoterNear gene startTranscription initiation
EnhancerUp to 100+ kb awayBoosts expression
5' UTRBefore start codonTranslation efficiency
3' UTRAfter stop codonmRNA stability, miRNA

Why non-coding is harder

  • Indirect · effect on expression, not protein
  • Tissue-specific · matters in liver, not brain
  • Context-dependent · only in certain conditions
  • Distance · enhancers can be 100 kb from target
  • Sparse databases · most untested clinically
§ 6

From Millions
to One

The clinical filtering funnel

Variant filtering funnel from 4-5 million variants down to 1-3 causal candidates
Figure 5. Sequential filters take you from ~4–5 million variants to 1–3 likely causal variants. Each filter removes orders of magnitude.

Five filters · five orders of magnitude

StepFilterSurviving
0All detected variants~4–5 M
1Frequency < 1% (rare)~50–100 k
2Coding + canonical splice~5–10 k
3Phenotype-relevant gene~100–500
4LoF or pathogenic missense~5–20
5Inheritance + family data1–3

Worked case · the KMT2D variant

  • Patient: child with developmental delay
  • Filter 1 (freq): absent in gnomAD ✓
  • Filter 2 (effect): nonsense in KMT2D ✓ (LoF)
  • Filter 3 (gene): KMT2D = Kabuki syndrome, pLI = 1.0 ✓
  • Filter 4 (inheritance): de novo, AD pattern ✓
  • Conclusion: pathogenic · explains phenotype

Pathogenic · uncertain · benign

ClassHallmarks
PathogenicLoF in disease gene · rare · high pLI · segregates · de novo
VUSMissense · moderate freq · conflicting predictions · novel
BenignCommon (>1%) · synonymous · seen in healthy

WES vs WGS · what can you see?

Variant typeWESWGS
Coding SNVs / indelsYesYes
Canonical spliceYesYes
Deep intronicNoYes
Promoter / enhancerNoYes
Structural variantsPoorLimited (better with long reads)

Start with WES for Mendelian. Escalate to WGS if negative.

§ 7

Summary

What to take away

  • Variant = any difference from the reference · ~4–5 M per genome
  • Four axes: size · location · effect · frequency
  • Sizes: SNV · indel · SV — different tech for each
  • Coding 5: synonymous · missense · nonsense · frameshift · in-frame
  • The ÷ 3 rule: frameshift vs in-frame
  • From millions to one: frequency → location → gene → effect → inheritance
Next lecture

Where do these variants
come from?

Chapter 9 · Transmission of Genetic Variants