BSMS205 · Genetics

Annotation &
Databases

Chapter 7 · Part I · The Human Genome
A question to start with

You sequenced a patient.
You got 5 million variants.
Now what?

The data deluge · per genome

Variant typeCount per person
SNVs · single nucleotide variants~4 – 5 million
Small indels · insertions / deletions~400,000 – 500,000
Structural variants · large rearrangementsthousands
Causal variant for a rare diseasetypically 1

A needle-in-a-haystack problem at scale.

The core operation

Annotation
adds meaning to each variant.

  • Raw VCF: chr17:43,124,027 · T → G
  • Annotated: BRCA1 · missense · p.Cys61Gly · pathogenic

Four questions for every variant

  1. LOCATION — gene? exon? regulatory region?
  2. FUNCTION — protein change? splicing effect?
  3. POPULATION — common or rare?
  4. CLINICAL — known disease association?
Every annotation pipeline is some answer
to these four questions.

Roadmap for today

  1. Annotation: the four core questions
  2. The three-tier database system
  3. HGVS nomenclature & a worked variant
  4. Computational pathogenicity predictors
  5. ACMG/AMP classification framework
  6. Biobanks & population diversity
  7. A clinical case · the filtering funnel
§ 1

The Four
Core Questions

The four-question frame

Four core questions of variant annotation: location, function, population, clinical
Figure 1. Every variant must be evaluated across four dimensions: genomic location, functional consequence, population frequency, and clinical relevance.

Question 1 · Location

  • Coding exon — most likely to disrupt protein
  • Intron — usually neutral, except splice sites
  • 5' / 3' UTR — translation, stability
  • Promoter / enhancer — regulatory
  • Intergenic — often unannotated function

Question 2 · Function

ClassEffectSeverity
Stop-gainPremature stop · truncated proteinHIGH
FrameshiftReading frame disruptedHIGH
Splice siteExon skipping · intron retentionHIGH
MissenseOne amino acid changedMODERATE
SynonymousSilent · same amino acidLOW

Question 3 · Population frequency

  • Common variant (>1%) → not a severe rare disease cause
  • Rare variant (<0.1%) → candidate for rare disease
  • Singleton or absent → flag for follow-up
A variant common in the population
cannot cause severe childhood disease.

Question 4 · Clinical

  • Has this exact variant been seen before?
  • If yes — was it linked to a disease?
  • If linked — by how many labs, on what evidence?

Question four is where ClinVar lives.

Tools that answer these questions

ToolStrengthOutput example
VEP (Ensembl)Comprehensive location · regulatory regions"CFTR exon 11, stop-gain, p.Gly542*"
ANNOVARMulti-database integration · filteringAdds gnomAD frequencies, conservation, disease links
SnpEffAutomatic impact tierHIGH / MODERATE / LOW / MODIFIER

In practice, researchers run multiple tools for cross-validation.

§ 2

The Three-Tier
Database System

The hierarchy

Three-tier variant database system: dbSNP, gnomAD, ClinVar
Figure 2. dbSNP catalogs known variants, gnomAD provides population frequencies, ClinVar offers clinical interpretations. Each tier answers a different question.

Tier 1 · dbSNP — the catalog

  • NCBI · maintained since 1998
  • Over 1.1 billion variant sites
  • Each variant gets a unique rs number — e.g. rs429358
  • Question answered: "Has this variant been documented?"

A common variant with an old rs ID is almost always benign.

Tier 2 · gnomAD — the frequency reference

  • Aggregates sequencing data from 140,000+ individuals
  • Allele frequencies by population · African, East Asian, European, etc.
  • Question answered: "How common is this variant?"
  • Provides constraint scores — pLI, LOEUF, missense Z

gnomAD logic in practice

Allele frequency 1%

  • Probably not the cause
  • Too common for severe disease
  • De-prioritize

Absent from 140,000+

  • Suspicious — investigate
  • Either ultra-rare or novel
  • High candidate priority

Why population diversity matters

A variant common in East Asians
but absent from European data
can be wrongly labeled pathogenic.
  • Reference data must match patient ancestry
  • Old gnomAD versions were European-skewed
  • Korean Variant Archive (KOVA): 22.8% better filtering for Korean patients

Tier 3 · ClinVar — clinical interpretation

  • NCBI clinical variant database
  • Links variants → diseases with evidence-based classifications
  • Submitted by labs, hospitals, expert panels worldwide
  • Question answered: "Is this variant disease-causing?"

The five-tier classification

ClassMeaning
PathogenicCauses disease · strong evidence
Likely pathogenicProbably causes disease · > 90%
VUSVariant of Uncertain Significance
Likely benignProbably harmless · > 90%
BenignHarmless · strong evidence

The VUS challenge

  • Many variants lack sufficient evidence
  • Cannot return as pathogenic or benign
  • Patients often recheck annually as data accrues
  • VUS reclassification is the slow grind of clinical genomics
Today's VUS is tomorrow's diagnosis
— or tomorrow's reassurance.

Three databases · three questions

TierDatabaseAsksScale
1dbSNPHas it been seen?1.1 billion
2gnomADHow common is it?140,000+ samples
3ClinVarDoes it cause disease?3 million entries
§ 3

HGVS &
A Worked Variant

HGVS · the variant naming standard

  • Human Genome Variation Society nomenclature
  • One unambiguous name for any variant
  • Three coordinate systems: g. genomic · c. coding · p. protein
NM_007294.3 : c.181T>G → p.Cys61Gly

Worked example · BRCA1 c.181T>G

Step 1 · Location

  • Gene: BRCA1
  • Exon 5
  • RING domain

Step 2 · Function

  • Missense
  • Cys61 → Gly
  • Loses zinc-binding cysteine

Worked example · continued

Step 3 · Population

  • gnomAD: absent
  • 0 / 140,000+ alleles

Step 4 · Clinical

  • ClinVar: Pathogenic
  • 30+ submissions
  • Founder mutation in some populations

Verdict: Clinically actionable pathogenic variant — informs cancer screening & family testing.

§ 4

When ClinVar
Says Nothing

The novel variant problem

  • Most variants found in patients are not in ClinVar
  • Especially missense — many possible amino acid swaps
  • Need to predict functional effect in silico
For an absent or single-submission variant,
computational scores are the next step.

Three modern predictors

ToolScoreThresholdBest for
CADDPhred-scaled>20 = top 1%
>30 = top 0.1%
General variant impact
REVEL0 – 1>0.5 likely pathogenicMissense in Mendelian disease
AlphaMissenseBenign / Ambiguous / Pathogenic3-class outputStructure-based · 71M variants

An important limitation

These are predictions, not proof.
Functional studies or clinical evidence
are required for confirmation.
  • Even AlphaMissense gets some known pathogenic variants wrong
  • Use as one piece of evidence — never alone
  • ACMG framework explicitly down-weights computational evidence

Integrating evidence

Integration of multiple evidence sources for variant interpretation
Figure 3. Clinical interpretation requires convergence of population data, databases, computational predictions, and patient phenotype.
§ 5

The ACMG/AMP
Classification Framework

The ACMG idea

  • Define standardized evidence codes · ~28 of them
  • Weight each code: very strong · strong · moderate · supporting
  • Combine codes → final 5-tier classification
  • Two axes: pathogenic evidence vs benign evidence

Key pathogenic evidence codes

CodeWeightMeaning
PVS1Very strongNull variant in a LoF-disease gene
PS1StrongSame amino-acid change as known pathogenic
PS2StrongDe novo · confirmed parents
PM1ModerateCritical functional domain · hotspot
PM2ModerateAbsent / ultra-rare in population data
PP3SupportingMultiple computational lines agree

Key benign evidence codes

CodeWeightMeaning
BA1Stand-aloneAllele frequency ≥ 5% in any population
BS1StrongHigher frequency than expected for the disease
BS3StrongFunctional studies show no damaging effect
BP4SupportingComputational lines agree: benign

Combining codes · example

BRCA1 c.181T>G accumulates:

  • PS1 · same amino acid change as known pathogenic
  • PM1 · RING domain · critical for function
  • PM2 · absent in gnomAD
  • PP3 · CADD, REVEL, AlphaMissense all agree
1 strong + 2 moderate + 1 supporting → Pathogenic
§ 6

Biobanks &
Genome Annotation

GENCODE vs RefSeq

FeatureGENCODERefSeq
Coverage~45,000 genes (coding + non-coding)~20,000 protein-coding
CurationManual + automatedPrimarily manual
IDsENSG / ENSTNM_ / NP_
Best forResearch · non-coding RNAClinical diagnostics

Clinical reports use RefSeq: NM_007294.3:c.5266dup

Biobanks · genotype + phenotype at scale

  • Large collections of DNA + health records
  • Question: "Do carriers of variant X have higher disease rates?"
  • Power: link variation to outcomes at population scale
  • The substrate of GWAS, drug discovery, drug response

Major biobanks

BiobankScalePopulationStrength
UK Biobank500,000UK adultsNHS records + imaging
All of Us245,000+US · diversity focusHealth disparities
KOVA5,305KoreanEast Asian reference
FinnGen500,000+FinnishFounder-effect rare variants
ToMMo150,000Japanese · 3 generationsGene × environment longitudinal

Why diversity is methodology

Population-specific data
prevents misclassification.
  • Old GWAS literature: ~80% European participants
  • Polygenic risk scores trained in Europeans fail in Africans
  • Korean rare-disease screening needs Korean reference data
§ 7

A Clinical Case
from Millions to One

The patient

  • 4-year-old · developmental delay
  • Recurrent seizures from infancy
  • Abnormal brain MRI
  • Three years of diagnostic odyssey · no answer
  • Whole-exome sequencing ordered

The filtering funnel

The variant filtering funnel from millions of variants to one causal variant
Figure 4. Sequential filtering through annotation, population frequency, clinical databases, computational prediction, and phenotype matching narrows millions of variants to one causal candidate.

The funnel in numbers

StepFilterVariants left
0WES total25,000
1Annotation: protein-affecting~200
2gnomAD < 0.5%~40
3Genes matching phenotype3
4CADD/REVEL/AlphaMissense agreement1

The diagnosis

  • Variant: SCN1A · NM_001165963.4:c.4970G>A · p.Arg1657His
  • CADD: 32 (top 0.1%)
  • REVEL: 0.89 · likely pathogenic
  • gnomAD: absent in 140,000+
  • De novo · confirmed parents
Diagnosis: Dravet syndrome

What the diagnosis enables

  • Treatment optimized — specific anti-seizure medications
  • Avoid contraindicated drugs (sodium channel blockers worsen Dravet)
  • Low recurrence risk for future children (de novo)
  • Family connected to Dravet support resources
  • Potential gene therapy / antisense trial enrollment
§ 8

Summary

What to take away

  • Annotation answers 4 questions: location · function · frequency · clinical
  • Three-tier database: dbSNP → gnomAD → ClinVar
  • HGVS gives every variant one canonical name
  • Predictors (CADD, REVEL, AlphaMissense) — vote, never veto
  • ACMG framework: structured evidence → 5-tier classification
  • Population diversity is methodology, not optics

End of Part I · The Human Genome

  • Ch 1 · Human Genome Project — the reference
  • Ch 2 · T2T — closing the gaps
  • Ch 3 · Genome organization
  • Ch 4 · Pangenome
  • Ch 5 · Sequencing technology
  • Ch 6 · NGS applications
  • Ch 7 · Annotation & databases — making sense of it
Next · Part II

We have the tools.
Now let's look at the
variants themselves.

Part II · Genetic Variation — types · mechanisms · consequences