BSMS205 · Genetics

Annotation &
Databases

Chapter 7 · Part I · The Human Genome

A question to start with

You sequenced a patient.
You got 5 million variants.
Now what?

The data deluge · per genome

Variant type	Count per person
SNVs · single nucleotide variants	~4 – 5 million
Small indels · insertions / deletions	~400,000 – 500,000
Structural variants · large rearrangements	thousands
Causal variant for a rare disease	typically 1

A needle-in-a-haystack problem at scale.

The core operation

Annotation
adds meaning to each variant.

Raw VCF: chr17:43,124,027 · T → G
Annotated: BRCA1 · missense · p.Cys61Gly · pathogenic

Four questions for every variant

LOCATION — gene? exon? regulatory region?
FUNCTION — protein change? splicing effect?
POPULATION — common or rare?
CLINICAL — known disease association?

Every annotation pipeline is some answer
to these four questions.

Roadmap for today

Annotation: the four core questions
The three-tier database system
HGVS nomenclature & a worked variant
Computational pathogenicity predictors
ACMG/AMP classification framework
Biobanks & population diversity
A clinical case · the filtering funnel

§ 1

The Four
Core Questions

The four-question frame

Four core questions of variant annotation: location, function, population, clinical — **Figure 1.** Every variant must be evaluated across four dimensions: genomic location, functional consequence, population frequency, and clinical relevance.

Question 1 · Location

Coding exon — most likely to disrupt protein
Intron — usually neutral, except splice sites
5' / 3' UTR — translation, stability
Promoter / enhancer — regulatory
Intergenic — often unannotated function

Question 2 · Function

Class	Effect	Severity
Stop-gain	Premature stop · truncated protein	HIGH
Frameshift	Reading frame disrupted	HIGH
Splice site	Exon skipping · intron retention	HIGH
Missense	One amino acid changed	MODERATE
Synonymous	Silent · same amino acid	LOW

Question 3 · Population frequency

Common variant (>1%) → not a severe rare disease cause
Rare variant (<0.1%) → candidate for rare disease
Singleton or absent → flag for follow-up

A variant common in the population
cannot cause severe childhood disease.

Question 4 · Clinical

Has this exact variant been seen before?
If yes — was it linked to a disease?
If linked — by how many labs, on what evidence?

Question four is where ClinVar lives.

Tools that answer these questions

Tool	Strength	Output example
VEP (Ensembl)	Comprehensive location · regulatory regions	"CFTR exon 11, stop-gain, p.Gly542*"
ANNOVAR	Multi-database integration · filtering	Adds gnomAD frequencies, conservation, disease links
SnpEff	Automatic impact tier	HIGH / MODERATE / LOW / MODIFIER

In practice, researchers run multiple tools for cross-validation.

§ 2

The Three-Tier
Database System

The hierarchy

Three-tier variant database system: dbSNP, gnomAD, ClinVar — **Figure 2.** dbSNP catalogs known variants, gnomAD provides population frequencies, ClinVar offers clinical interpretations. Each tier answers a different question.

Tier 1 · dbSNP — the catalog

NCBI · maintained since 1998
Over 1.1 billion variant sites
Each variant gets a unique rs number — e.g. rs429358
Question answered: "Has this variant been documented?"

A common variant with an old rs ID is almost always benign.

Tier 2 · gnomAD — the frequency reference

Aggregates sequencing data from 140,000+ individuals
Allele frequencies by population · African, East Asian, European, etc.
Question answered: "How common is this variant?"
Provides constraint scores — pLI, LOEUF, missense Z

gnomAD logic in practice

Allele frequency 1%

Probably not the cause
Too common for severe disease
De-prioritize

Absent from 140,000+

Suspicious — investigate
Either ultra-rare or novel
High candidate priority

Why population diversity matters

A variant common in East Asians
but absent from European data
can be wrongly labeled pathogenic.

Reference data must match patient ancestry
Old gnomAD versions were European-skewed
Korean Variant Archive (KOVA): 22.8% better filtering for Korean patients

Tier 3 · ClinVar — clinical interpretation

NCBI clinical variant database
Links variants → diseases with evidence-based classifications
Submitted by labs, hospitals, expert panels worldwide
Question answered: "Is this variant disease-causing?"

The five-tier classification

Class	Meaning
Pathogenic	Causes disease · strong evidence
Likely pathogenic	Probably causes disease · > 90%
VUS	Variant of Uncertain Significance
Likely benign	Probably harmless · > 90%
Benign	Harmless · strong evidence

The VUS challenge

Many variants lack sufficient evidence
Cannot return as pathogenic or benign
Patients often recheck annually as data accrues
VUS reclassification is the slow grind of clinical genomics

Today's VUS is tomorrow's diagnosis
— or tomorrow's reassurance.

Three databases · three questions

Tier	Database	Asks	Scale
1	dbSNP	Has it been seen?	1.1 billion
2	gnomAD	How common is it?	140,000+ samples
3	ClinVar	Does it cause disease?	3 million entries

§ 3

HGVS &
A Worked Variant

HGVS · the variant naming standard

Human Genome Variation Society nomenclature
One unambiguous name for any variant
Three coordinate systems: g. genomic · c. coding · p. protein

NM_007294.3 : c.181T>G → p.Cys61Gly

Worked example · BRCA1 c.181T>G

Step 1 · Location

Gene: BRCA1
Exon 5
RING domain

Step 2 · Function

Missense
Cys61 → Gly
Loses zinc-binding cysteine

Worked example · continued

Step 3 · Population

gnomAD: absent
0 / 140,000+ alleles

Step 4 · Clinical

ClinVar: Pathogenic
30+ submissions
Founder mutation in some populations

Verdict: Clinically actionable pathogenic variant — informs cancer screening & family testing.

§ 4

When ClinVar
Says Nothing

The novel variant problem

Most variants found in patients are not in ClinVar
Especially missense — many possible amino acid swaps
Need to predict functional effect in silico

For an absent or single-submission variant,
computational scores are the next step.

Three modern predictors

Tool	Score	Threshold	Best for
CADD	Phred-scaled	>20 = top 1% >30 = top 0.1%	General variant impact
REVEL	0 – 1	>0.5 likely pathogenic	Missense in Mendelian disease
AlphaMissense	Benign / Ambiguous / Pathogenic	3-class output	Structure-based · 71M variants

An important limitation

These are predictions, not proof.
Functional studies or clinical evidence
are required for confirmation.

Even AlphaMissense gets some known pathogenic variants wrong
Use as one piece of evidence — never alone
ACMG framework explicitly down-weights computational evidence

Integrating evidence

Integration of multiple evidence sources for variant interpretation — **Figure 3.** Clinical interpretation requires convergence of population data, databases, computational predictions, and patient phenotype.

§ 5

The ACMG/AMP
Classification Framework

The ACMG idea

Define standardized evidence codes · ~28 of them
Weight each code: very strong · strong · moderate · supporting
Combine codes → final 5-tier classification
Two axes: pathogenic evidence vs benign evidence

Key pathogenic evidence codes

Code	Weight	Meaning
PVS1	Very strong	Null variant in a LoF-disease gene
PS1	Strong	Same amino-acid change as known pathogenic
PS2	Strong	De novo · confirmed parents
PM1	Moderate	Critical functional domain · hotspot
PM2	Moderate	Absent / ultra-rare in population data
PP3	Supporting	Multiple computational lines agree

Key benign evidence codes

Code	Weight	Meaning
BA1	Stand-alone	Allele frequency ≥ 5% in any population
BS1	Strong	Higher frequency than expected for the disease
BS3	Strong	Functional studies show no damaging effect
BP4	Supporting	Computational lines agree: benign

Combining codes · example

BRCA1 c.181T>G accumulates:

PS1 · same amino acid change as known pathogenic
PM1 · RING domain · critical for function
PM2 · absent in gnomAD
PP3 · CADD, REVEL, AlphaMissense all agree

1 strong + 2 moderate + 1 supporting → Pathogenic

§ 6

Biobanks &
Genome Annotation

GENCODE vs RefSeq

Feature	GENCODE	RefSeq
Coverage	~45,000 genes (coding + non-coding)	~20,000 protein-coding
Curation	Manual + automated	Primarily manual
IDs	`ENSG / ENST`	`NM_ / NP_`
Best for	Research · non-coding RNA	Clinical diagnostics

Clinical reports use RefSeq: NM_007294.3:c.5266dup

Biobanks · genotype + phenotype at scale

Large collections of DNA + health records
Question: "Do carriers of variant X have higher disease rates?"
Power: link variation to outcomes at population scale
The substrate of GWAS, drug discovery, drug response

Major biobanks

Biobank	Scale	Population	Strength
UK Biobank	500,000	UK adults	NHS records + imaging
All of Us	245,000+	US · diversity focus	Health disparities
KOVA	5,305	Korean	East Asian reference
FinnGen	500,000+	Finnish	Founder-effect rare variants
ToMMo	150,000	Japanese · 3 generations	Gene × environment longitudinal

Why diversity is methodology

Population-specific data
prevents misclassification.

Old GWAS literature: ~80% European participants
Polygenic risk scores trained in Europeans fail in Africans
Korean rare-disease screening needs Korean reference data

§ 7

A Clinical Case
from Millions to One

The patient

4-year-old · developmental delay
Recurrent seizures from infancy
Abnormal brain MRI
Three years of diagnostic odyssey · no answer
Whole-exome sequencing ordered

The filtering funnel

The variant filtering funnel from millions of variants to one causal variant — **Figure 4.** Sequential filtering through annotation, population frequency, clinical databases, computational prediction, and phenotype matching narrows millions of variants to one causal candidate.

The funnel in numbers

Step	Filter	Variants left
0	WES total	25,000
1	Annotation: protein-affecting	~200
2	gnomAD < 0.5%	~40
3	Genes matching phenotype	3
4	CADD/REVEL/AlphaMissense agreement	1

The diagnosis

Variant: SCN1A · NM_001165963.4:c.4970G>A · p.Arg1657His
CADD: 32 (top 0.1%)
REVEL: 0.89 · likely pathogenic
gnomAD: absent in 140,000+
De novo · confirmed parents

Diagnosis: Dravet syndrome

What the diagnosis enables

Treatment optimized — specific anti-seizure medications
Avoid contraindicated drugs (sodium channel blockers worsen Dravet)
Low recurrence risk for future children (de novo)
Family connected to Dravet support resources
Potential gene therapy / antisense trial enrollment

§ 8

Summary

What to take away

Annotation answers 4 questions: location · function · frequency · clinical
Three-tier database: dbSNP → gnomAD → ClinVar
HGVS gives every variant one canonical name
Predictors (CADD, REVEL, AlphaMissense) — vote, never veto
ACMG framework: structured evidence → 5-tier classification
Population diversity is methodology, not optics

End of Part I · The Human Genome

Ch 1 · Human Genome Project — the reference
Ch 2 · T2T — closing the gaps
Ch 3 · Genome organization
Ch 4 · Pangenome
Ch 5 · Sequencing technology
Ch 6 · NGS applications
Ch 7 · Annotation & databases — making sense of it

Next · Part II

We have the tools.
Now let's look at the
variants themselves.

Part II · Genetic Variation — types · mechanisms · consequences