BSMS205 · Genetics

Genetic
Variants

Chapter 8 · Part II · Variation

A question to start with

How is your DNA
different from mine?

Where we left off · where we go now

Chapter 7

How we annotate variants
gnomAD · ClinVar · pLI
Frequency & pathogenicity tags

Chapter 8

What kinds of variants exist
Size · location · effect · frequency
From millions to one

The reality of one human genome

4–5,000,000

variants per person · vs the reference

One change every 600–800 bases
~500,000 indels
~1,000–2,000 structural variants

Roadmap for today

What is a variant? · terminology
The classification framework · four axes
Variants by size · SNV · indel · SV
Variants by effect · the coding five
Non-coding variants · the harder 98%
From millions to one · the filtering funnel
Summary · what comes next

§ 1

What Is
a Variant?

The definition

A genetic variant is a difference in DNA sequence
between an individual's genome and the reference.

Reference = an agreed-upon standard sequence
Variant = any deviation from it
Could be 1 base · could be 1 megabase

One base, one variant — visualized

Reference genome compared to individual genome with a single G to T change — **Figure 1.** Reference vs individual genome. A single G→T change at one position is one variant. Every human carries 4–5 million such differences.

Variant · mutation · polymorphism · SNP

Term	Meaning	Connotation
Variant	Any DNA difference	Neutral — default
Mutation	DNA change, often rare	Implies disease
Polymorphism	Common variant (>1%)	Usually benign
SNP	Single Nucleotide Polymorphism	"snip" · common, single base

Examples — same word, different meaning

"CFTR mutation" · rare disease change · cystic fibrosis
"ABO polymorphism" · common blood-type variation
"Novel variant" · newly seen · significance unknown
"SNP rs334" · the sickle-cell variant in HBB

In this course we say variant by default.

§ 2

Four Axes
of Classification

Four perspectives on one variant

Four-axis classification framework: size, location, effect, frequency — **Figure 2.** Every variant lives on four axes: by **size**, by **location**, by **effect**, by **frequency**. The same variant can be described by all four at once.

The four axes — at a glance

Axis	Categories	Why it matters
Size	SNV · indel · SV	Drives detection method
Location	Coding · regulatory · intronic	Drives interpretability
Effect	LoF · missense · synonymous	Drives pathogenicity
Frequency	Common · rare	Drives clinical filtering

Worked example

One variant · four labels

Size: 1 base — SNV
Location: coding exon of HBB
Effect: missense (Glu→Val)
Frequency: common in West Africa, rare globally

That is the sickle-cell variant · rs334.

§ 3

Variants
by Size

Three size classes

Variant types by size: SNV, indel, structural variant — **Figure 3.** SNVs (1 bp), indels (1–50 bp), structural variants (>50 bp). Three size classes spanning six orders of magnitude.

SNVs · single base changes

Reference: ...ATGCGATCG...
Your DNA:  ...ATGCTATCG...
                  ↑
                G→T

~4–5 million per genome
One every 600–800 bases
The "bread and butter" of variant calling
Easiest to detect with WGS or WES

Indels · insertions and deletions

Reference:  ...ATGCGATCG...
Deletion:   ...ATG---TCG...    (CGA deleted, 3 bp)
Insertion:  ...ATGCAAAGATCG... (AAA inserted, 3 bp)

Typically 1–50 bp
~400,000–500,000 per genome
Slightly harder to call than SNVs
Especially tricky in repetitive regions

Structural variants · the big stuff

>50 bp · often >1 kb
Deletions — can remove whole genes
Duplications — extra copies
Inversions — segment flipped
Translocations — moved between chromosomes
CNVs — copy number variation

SVs are rare in count, huge in impact

~1,500

SVs per person

Only ~1,500 events
But affect more total bases than all SNVs combined
One large deletion = millions of changed bases

Detection — different sizes need different tech

Technology	SNV	Indel	SV
Short-read WGS (Illumina)	Excellent	Good	Limited
Short-read WES	Excellent (in exons)	Good (in exons)	Very poor
Long-read WGS (PacBio/Nanopore)	Good	Excellent	Excellent

§ 4

Coding
Consequences

The 98 / 2 rule

coding · interpretable

98%

non-coding · harder

Most known disease variants live in the 2%.

The genetic code reads in 3-base codons

DNA:     ATG  CAT  GCA  TTG  AAA
              ↓    ↓    ↓    ↓    ↓
Protein: Met  His  Ala  Leu  Lys

Each codon = 3 bases = 1 amino acid
The code is redundant — many codons per amino acid
Reading frame matters · shift it and everything changes

The five coding variant types

Type	DNA change	Protein effect	Pathogenicity
Synonymous	GAA→GAG	Glu→Glu (same)	Usually benign
Missense	GAA→GCA	Glu→Ala	Variable
Nonsense	CAG→TAG	Gln→STOP	Almost always harmful
Frameshift	1 or 2 bp indel	Scrambled downstream	Almost always harmful
In-frame indel	3, 6, 9 bp	±1 amino acid	Variable

Synonymous · the silent ones

Reference: GAA  →  Glu (glutamate)
Variant:   GAG  →  Glu (glutamate) — same amino acid

~25% of coding changes
Codon redundancy makes them silent at the protein level
Usually benign
But — can affect splicing or translation efficiency

Missense · the gray zone

Reference: GAA  →  Glu (glutamate, charged)
Variant:   GCA  →  Ala (alanine, hydrophobic)

One amino acid → a different one
Effect depends on which AA, where in protein
Conservative (similar AAs) → often benign
Radical (charged ↔ hydrophobic) → often harmful
Most VUS are missense

Nonsense · the truncators

Reference: CAG  →  Gln (glutamine)
Variant:   TAG  →  STOP

Premature stop codon → truncated protein
Often degraded by nonsense-mediated decay (NMD)
Effectively a knockout of one allele
Almost always pathogenic

Frameshift vs in-frame · the divisible-by-3 rule

Frameshift vs in-frame indel comparison showing reading frame consequences — **Figure 4.** Left: 1-bp deletion shifts the reading frame — wrong amino acids and an early stop. Right: 3-bp deletion preserves the frame — one amino acid removed, the rest intact.

Frameshift example · 1 bp deletion

Normal:      ATG | CAT | GCA | TTG | AAA
             Met - His - Ala - Leu - Lys

Frameshift:  ATG | CA_ | GCA | TTG | AAA   ← 1 bp deleted
re-read as:  ATG | CAG | CAT | TGA | AA...
             Met - Gln - His - STOP

Wrong amino acids from the deletion onward
Usually a premature stop within ~100 bp
Functionally a knockout

In-frame example · 3 bp deletion

Normal:      ATG | CAT | GCA | TTG | AAA
             Met - His - Ala - Leu - Lys

In-frame:    ATG | CAT | ___ | TTG | AAA   ← 3 bp deleted
             ATG | CAT | TTG | AAA
             Met - His - Leu - Lys ← one AA missing

Frame preserved · downstream reads correctly
Just one amino acid removed
Often partially functional
Example: EGFR in-frame del → hyperactive in lung cancer

LoF · the disease-aligned shorthand

Loss of Function (LoF) =
nonsense + frameshift + canonical splice + large deletion

All produce no functional protein from that allele
Subject to nonsense-mediated decay
The strongest single signal for pathogenicity
Combined with high pLI → very strong evidence

§ 5

The Other
98%

Splice site variants · breaking the cut

Genes contain introns (cut out) and exons (kept)
Splice donor: GT · acceptor: AG
Canonical splice variants (GT→AT, AG→AA) → splicing fails
Result: exon skipped or intron retained
Counted as LoF · example: BRCA2 splice → breast cancer

Regulatory variants · changing expression

Element	Where	Effect
Promoter	Near gene start	Transcription initiation
Enhancer	Up to 100+ kb away	Boosts expression
5' UTR	Before start codon	Translation efficiency
3' UTR	After stop codon	mRNA stability, miRNA

Why non-coding is harder

Indirect · effect on expression, not protein
Tissue-specific · matters in liver, not brain
Context-dependent · only in certain conditions
Distance · enhancers can be 100 kb from target
Sparse databases · most untested clinically

§ 6

From Millions
to One

The clinical filtering funnel

Variant filtering funnel from 4-5 million variants down to 1-3 causal candidates — **Figure 5.** Sequential filters take you from ~4–5 million variants to 1–3 likely causal variants. Each filter removes orders of magnitude.

Five filters · five orders of magnitude

Step	Filter	Surviving
0	All detected variants	~4–5 M
1	Frequency < 1% (rare)	~50–100 k
2	Coding + canonical splice	~5–10 k
3	Phenotype-relevant gene	~100–500
4	LoF or pathogenic missense	~5–20
5	Inheritance + family data	1–3

Worked case · the KMT2D variant

Patient: child with developmental delay
Filter 1 (freq): absent in gnomAD ✓
Filter 2 (effect): nonsense in KMT2D ✓ (LoF)
Filter 3 (gene): KMT2D = Kabuki syndrome, pLI = 1.0 ✓
Filter 4 (inheritance): de novo, AD pattern ✓
Conclusion: pathogenic · explains phenotype

A real case to make the funnel concrete. Child presents with developmental delay. Whole-exome sequencing finds a novel variant in K M T two D. Apply the filters. Filter one — frequency. Absent from gnomAD's one hundred forty thousand individuals. Rare, check. Filter two — effect. It is a nonsense variant, creating a premature stop. Loss of function, check. Filter three — gene. K M T two D is the Kabuki syndrome gene, and its p L I score is one point zero, meaning it is extremely intolerant to loss of function. Phenotype matches, gene constraint matches, check. Filter four — inheritance. Variant is de novo, not present in either parent, and Kabuki is autosomal dominant. Check. Conclusion: pathogenic. This variant explains the patient's phenotype. Five filters, definitive answer.

Pathogenic · uncertain · benign

Class	Hallmarks
Pathogenic	LoF in disease gene · rare · high pLI · segregates · de novo
VUS	Missense · moderate freq · conflicting predictions · novel
Benign	Common (>1%) · synonymous · seen in healthy

WES vs WGS · what can you see?

Variant type	WES	WGS
Coding SNVs / indels	Yes	Yes
Canonical splice	Yes	Yes
Deep intronic	No	Yes
Promoter / enhancer	No	Yes
Structural variants	Poor	Limited (better with long reads)

Start with WES for Mendelian. Escalate to WGS if negative.

§ 7

Summary

What to take away

Variant = any difference from the reference · ~4–5 M per genome
Four axes: size · location · effect · frequency
Sizes: SNV · indel · SV — different tech for each
Coding 5: synonymous · missense · nonsense · frameshift · in-frame
The ÷ 3 rule: frameshift vs in-frame
From millions to one: frequency → location → gene → effect → inheritance

Next lecture

Where do these variants
come from?

Chapter 9 · Transmission of Genetic Variants