Description
Genomic Evolutionary Rate Profiling (GERP) is a method for producing
position-specific estimates of evolutionary constraint using maximum likelihood evolutionary rate
estimation. It also discovers "constrained elements" where multiple positions combine to give a
signal that is indicative of a putative functional element; this track shows the position-specific
scores only, not the element predictions.
Constraint intensity at each individual alignment
position is quantified in terms of a "rejected substitutions" (RS) score, defined as the number of
substitutions expected under neutrality minus the number of substitutions "observed" at the
position. This concept was described, and a first implementation of GERP was presented, in Cooper et
al (2005). GERP++ as described in Davydov et al (2010) uses a more rigorous set of algorithms to
calculate site-specific RS scores and to discover evolutionarily constrained elements.
Sites are scored independently. Positive scores represent a substitution deficit (i.e., fewer
substitutions than the average neutral site) and thus indicate that a site may be under evolutionary
constraint. Negative scores indicate that a site is probably evolving neutrally; negative scores
should not be interpreted as evidence of accelerated rates of evolution because of too many strong
confounders, such as alignment uncertainty or rate variance. Positive scores scale with the level of
constraint, such that the greater the score, the greater the level of evolutionary constraint
inferred to be acting on that site.
We applied GERP, as implemented in the GERP++ software
package, to quantify the level of evolutionary constraint acting on each site in hg19, based on an
alignment of 35 mammals to hg19 with a maximum phylogenetic scope of 6.18 substitutions per neutral
site. Gaps in the alignment are treated as missing data, which means that the number of
substitutions per neutral site will be less than 6.18 in sites where one or more species has a gap.
Thus, RS scores range from a maximum of 6.18 down to a below-zero minimum, which we cap at -12.36.
RS scores will vary with alignment depth and level of sequence conservation. A score of 0 indicates
that the alignment was too shallow at that position to get a meaningful estimate of constraint.
Should classification into "constrained" and "unconstrained" sites be desired, a threshold may be
chosen above which sites are considered "constrained". In practice, we find that a RS score
threshold of 2 provides high sensitivity while still strongly enriching for truly constrained sites.
Methods
Given a multiple sequence alignment and a phylogenetic tree with branch
lengths representing the neutral rate between the species within that alignment, GERP++ quantifies
constraint intensity at each individual position in terms of rejected substitutions, the difference
between the neutral rate and the estimated evolutionary rate at the position. GERP++ begins with a
pre-defined neutral tree relating the genomes present within the alignment that supplies both the
total neutral rate across the entire tree and the relative length of each individual branch. For
each alignment column, we estimate a scaling factor, applied uniformly to all branches of the tree,
that maximizes the probability of the observed nucleotides in the alignment column. The product of
the scaling factor and the neutral rate defines the 'observed' rate of evolution at each position.
GERP++ uses the HKY85 model of evolution with the transition/transversion ratio set to 2.0 and
nucleotide frequencies estimated from the multiple alignment.
To generate RS scores for
each position in the human genome, we used GERP++ to analyze the TBA alignment of hg19 to 35 other
mammalian species (listed here:
http://hgdownload.soe.ucsc.edu/goldenPath/hg19/multiz46way/), spanning over 3 billion
positions (see the description for the 'Conservation' track for details of this alignment). The
alignment was compressed to remove gaps in the human sequence, and GERP++ scores were computed for
every position with at least 3 ungapped species present. Importantly, the human sequence was
removed from the alignment and not included in either the neutral rate estimation or the
site-specific "observed" estimates, and therefore is not included in the RS score. This is
consistent with the published work on GERP, and is done to eliminate the confounding influence of
deleterious derived alleles segregating in the human population that are present in the reference
sequence. The phylogenetic tree used was the generally accepted topology. Neutral branch lengths
were estimated from 4-fold degenerate sites in the alignment.
Credits
The RS
scores were generated by David Goode, Dept. of Genetics, Stanford University. GERP++ was developed
by Eugene Davydov and Serafim Batzoglou, Dept. of Computer Science, Stanford University; Arend
Sidow, Depts. of Pathology and Genetics, Stanford University; and Gregory Cooper, HudsonAlpha
Institute for Biotechnology, Huntsville, AL.
References
Davydov EV, Goode DL,
Sirota M, Cooper GM, Sidow A, Batzoglou S. Identifying a high fraction of the human genome to be under selective constraint
using GERP++ . PLoS Comput Biol. 2010 Dec 2;6(12):e1001025.
Cooper GM, Stone
EA, Asimenos G; NISC Comparative Sequencing Program, Green ED, Batzoglou S, Sidow A. Distribution and intensity of
constraint in mammalian genomic sequence . Genome Res. 2005 Jul;15(7):901-13.
For more information on using GERP to detect putatively functional genetic variation: Cooper
GM, Goode DL, Ng SB, Sidow A, Bamshad MJ, Shendure J, Nickerson DA.
Single-nucleotide evolutionary constraint scores highlight disease-causing mutations .
Nature Methods. 2010 Apr;7(4):250-1.
Goode DL, Cooper GM, Schmutz J, Dickson M,
Gonzales E, Tsai M, Karra K, Davydov E, Batzoglou S, Myers RM, Sidow A. Evolutionary constraint
facilitates interpretation of genetic variation in resequenced human genomes . Genome Res.
2010 Mar;20(3):301-10.
|
|