new
Note: October 9, 2024
Description
This track collection shows Combined Annotation Dependent Depletion scores.
CADD is a tool for scoring the deleteriousness of single nucleotide variants as
well as insertion/deletion variants in the human genome.
Some mutation annotations
tend to exploit a single information type (e.g., phastCons or phyloP for
conservation) and/or are restricted in scope (e.g., to missense changes). Thus,
a broadly applicable metric that objectively weights and integrates diverse
information is needed. Combined Annotation Dependent Depletion (CADD) is a
framework that integrates multiple annotations into one metric by contrasting
variants that survived natural selection with simulated mutations.
CADD scores strongly correlate with allelic diversity, pathogenicity of both
coding and non-coding variants, experimentally measured regulatory effects,
and also rank causal variants within individual genome sequences with a higher
value than non-causal variants.
Finally, CADD scores of complex trait-associated variants from genome-wide
association studies (GWAS) are significantly higher than matched controls and
correlate with study sample size, likely reflecting the increased accuracy of
larger GWAS.
A CADD score represents a ranking not a prediction, and no threshold is defined
for a specific purpose. Higher scores are more likely to be deleterious:
Scores are
10 * -log of the rank
so that variants with scores above 20 are
predicted to be among the 1.0% most deleterious possible substitutions in
the human genome. We recommend thinking carefully about what threshold is
appropriate for your application.
Display Conventions and Configuration
There are six subtracks of this track: four for single-nucleotide mutations,
one for each base, showing all possible substitutions,
one for insertions and one for deletions. All subtracks show the CADD Phred
score on mouseover. Zooming in shows the exact score on mouseover, same
basepair = score 0.0.
PHRED-scaled scores are normalized to all potential ~9 billion SNVs, and
thereby provide an externally comparable unit for analysis. For example, a
scaled score of 10 or greater indicates a raw score in the top 10% of all
possible reference genome SNVs, and a score of 20 or greater indicates a raw
score in the top 1%, regardless of the details of the annotation set, model
parameters, etc.
The four single-nucleotide mutation tracks have a default viewing range of
score 10 to 50. As explained in the paragraph above, that results in
slightly less than 10% of the data displayed. The
deletion and insertion tracks have a default filter of 10-100, because they
display discrete items and not graphical data.
Single nucleotide variants (SNV): For SNVs, at every
genome position, there are three values per position, one for every possible
nucleotide mutation. The fourth value, "no mutation", representing
the reference allele, e.g., A to A, is always set to zero.
When using this track, zoom in until you can see every basepair at the
top of the display. Otherwise, there are several nucleotides per pixel under
your mouse cursor and instead of an actual score, the tooltip text will show
the average score of all nucleotides under the cursor. This is indicated by
the prefix "~" in the mouseover. Averages of scores are not useful for any
application of CADD.
Insertions and deletions: Scores are also shown on mouseover for a
set of insertions and deletions. On hg38, the set has been obtained from
gnomAD3. On hg19, the set of indels has been obtained from various sources
(gnomAD2, ExAC, 1000 Genomes, ESP). If your insertion or deleletion of interest
is not in the track, you will need to use CADD's
online scoring tool
to obtain them.
Methods
In CADD version 1.7, new features have been added to improve CADD scores for certain variant
effects, boosting the overall performance of CADD and bringing new developments to the community.
CADD v1.7 integrates annotations from recent efforts to assess variant effects, along with new
conservation and mutation scores.
CADD v1.7 supports only the major chromosomes of the hg38/GRCh38 reference genome (chromosomes 1-22,
X, and Y) and may be the last version to support the hg19/GRCh37 human reference genome.
This version includes scores derived from Evolutionary Scale Modeling (ESM) for assessing variants
in protein-coding regions, along with scores from a convolutional neural network (CNN) trained on
open chromatin sequences, used as a proxy for regulatory regions in the genome. The previously
included conservation scores have been updated with data from the Zoonomia project. New annotations
have also been added for 3' Untranslated Regions (3' UTRs), along with models of genome-wide
mutational rates. The gene and transcript models have been updated by advancing from Ensembl version
95 to version 110, and the Ensembl Variant Effect Predictor (VEP) has been upgraded accordingly.
The models in CADD v1.7 have been trained similarly to the version 1.6 release. The logistic
regression uses an L2 penalty with C = 1, and training was completed after thirteen L-BFGS
iterations using the sklearn library The new models exhibit a high degree of similarity to the
previous release, with a Spearman correlation of 0.946 for CADD scores calculated for 100,000
randomly selected variants between CADD GRCh38-v1.6 and CADD GRCh38-v1.7. The v1.7 models perform
comparably to earlier versions in distinguishing known pathogenic variants (ClinVar) from common
variants (gnomAD) across the genome. Improvements in CADD v1.7 are particularly evident when
focusing on specific variant categories, such as missense or 3' UTR variants, where the latest
release includes updated annotations.
More information can be found at the
CADD site
and the Schubach et al., Nucleic Acids Res, 2024 publication.
Data were converted from the files provided on
the CADD Downloads website,
provided by the Kircher lab, using
custom Python scripts,
documented in our
makeDoc files.
Data access
CADD scores are freely available for all non-commercial applications from
the CADD website.
For commercial applications, see
the license instructions there.
The CADD data on the UCSC Genome Browser can be explored interactively with the
Table Browser or the
Data Integrator.
For automated download and analysis, the genome annotation is stored at UCSC in bigWig and bigBed
files that can be downloaded from
our download server.
The files for this track are called a.bw, c.bw, g.bw, t.bw, ins.bb and del.bb. Individual
regions or the whole genome annotation can be obtained using our tools bigWigToWig
or bigBedToBed which can be compiled from the source code or downloaded as a precompiled
binary for your system. Instructions for downloading source code and binaries can be found
here.
The tools can also be used to obtain features confined to a given range, e.g.,
bigWigToBedGraph -chrom=chr1 -start=100000 -end=100500 http://hgdownload.soe.ucsc.edu/gbdb/hg19/cadd1.7/a.bw stdout
or
bigBedToBed -chrom=chr1 -start=100000 -end=100500 http://hgdownload.soe.ucsc.edu/gbdb/hg19/cadd1.7/ins.bb stdout
Credits
Thanks to the CADD development team for providing precomputed data as simple tab-separated files.
References
Kircher M, Witten DM, Jain P, O'Roak BJ, Cooper GM, Shendure J.
A general framework for estimating the relative pathogenicity of human genetic variants.
Nat Genet. 2014 Mar;46(3):310-5.
PMID: 24487276;
PMC: PMC3992975
Rentzsch P, Witten D, Cooper GM, Shendure J, Kircher M.
CADD: predicting the deleteriousness of variants throughout the human genome.
Nucleic Acids Res. 2019 Jan 8;47(D1):D886-D894.
PMID: 30371827;
PMC: PMC6323892
Schubach M, Maass T, Nazaretyan L, Röner S, Kircher M.
CADD v1.7: using protein language models, regulatory CNNs and other nucleotide-level scores to
improve genome-wide variant predictions.
Nucleic Acids Res. 2024 Jan 5;52(D1):D1143-D1154.
PMID: 38183205; PMC: PMC10767851
|