Description
The ENCODE project has revealed the functional elements of segments
of the human genome in unprecedented detail. However, the ability to
distinguish between transcripts designated for translation into protein
and those that serve purely regulatory roles remains elusive. A
standard means to determine if translation is occurring is to measure
protein produced by transcripts via mass spectrometry-based proteogenomic
mapping. In this process, proteins were digested to peptides using a protease
such as trypsin and these petides were chromatographically fractionated and
fed into a tandem mass spectrometer (MS/MS). This process creates a signature
series of fragment masses which can be scanned against the theoretical
translation and proteolytic digest of an entire genome to identify the genomic
origins of sample proteins (Giddings et al., 2003).
This proteogenomic track displays mass spectrometry data that have been
matched to genomic sequences for selected cell lines, using a workflow and
software specifically designed for this purpose. The track can be used to
identify which parts of the genome are translated into proteins, to verify
which transcripts discovered by other ENCODE experiments are protein-coding,
to reveal new genes and/or splice variants and proteins with post-translational
modifications (PTM). Of particular interest is the possibility
of uncovering the translation of small open reading frames (ORFs), antisense
transcripts, or protein-coding regions that have been annotated as introns
previously.
Display Conventions and Configuration
This track is a multi-view composite track that contains multiple data types
(views). For each view, there are multiple subtracks that
display individually on the browser. Instructions for configuring multi-view
tracks are
here.
To show only selected subtracks, uncheck the boxes next to the tracks that
you wish to hide. Color differences among the views are arbitrary. They provide a
visual cue for distinguishing between the different cell types and compartments.
Metadata for a particular subtrack can be found by clicking the down arrow in the list of subtracks.
This track shows peptide mappings as contiguous rectangular
items rendered in grayscale according to their
score, with darker items representing higher-confidence peptide mappings.
The name of each item is the amino acid sequence of the peptide where a period (.)
at the end of a name signifies a stop codon.
- Peptide Genome and GENCODE Mapping(Filtered)
- Peptide mapping results based on hg19 and GENCODE annotation for mass-spectrometry-based proteomics experiments filtered for a false discovery rate (FDR) better than 5%. Specific field descriptions can be found below.
- Modified Peptide Genome and GENCODE Mapping(Filtered)
- Modified peptide mapping results based on hg19 and GENCODE annotation for mass-spectrometry-based proteomics experiments filtered for a false discovery rate (FDR) better than 5%.
Unfiltered views are available on the Downloads page.
Fields specific to Proteogenomic tracks include:
- The Item names the peptide sequence and
is appended with a number for proteins with post-translational modifications (PTM)
representing the integer portion of the PTM mass. The peptide sequence appears as a short label beside
the main Genome Browser display window depending on the view configuration.
- The Score is used to render shade to
displayed rectangular items and is derived from the rawScore (see below) given by the proteomics
peptide mapping software Peppy. It is computed as [(rawScore
- rawScore at 10% FDR cutoff) /
(rawScore at near 0% FDR cutoff -
rawScore at 10% FDR cutoff)] * 1000,
and is then converted to an integer. Raw scores above the 0% FDR threshold have
a score of 1000 (best), while those below the 10% FDR threshold have a score of 0 (worst).
- The rawScore is given by Peppy
and is expressed as the negative log 10 of the p-value, which reflects
the confidence of the mapping between the peptides and the spectrums.
On the item details pages, rawScore is labeled: Raw score for a peptide/spectrum match.
- The spectrumId is an identifier of the
spectrum associated with the peptide mapping and can be used to track the original spectrum.
On the item details pages, spectrumId is labeled: An identifier of the spectrum
associated with the peptide mapping.
- The peptideRank is a rank of the
peptide/spectrum match used for a spectrum matching to different peptides. A spectrum
can be chimeric (containing more than one peptide) and the spectrum can be mapped
to two or more distinct peptides. Here, only the top-scoring match is reported. If more
than one peptide "tied" for the top score, then all peptides were included and all
matches have a peptideRank of 1. On the
item details pages, peptideRank is labeled: Rank of the peptide/spectrum match,
for spectrum matching to different peptides.
- The peptideRepeatCount indicates the
number of places in the genome where the peptide is encoded for a peptide/spectrum
match. It reflects the prevalence or uniqueness of the peptide mapping in the genome.
Those peptides mapped to only a few genomic locations will have a low
peptideRepeatCount, whereas those peptides mapped
to highly duplicated regions will have a high peptideRepeatCount. Peptides with
a peptideRepeatCount greater than 10 times in the genome were
deleted from the track (this field is for regular peptides only). On the
item details pages, peptideRepeatCount is labeled: Indicates the number
of places in the genome where the peptide is encoded for a peptide/spectrum match.
- The modificationMass reflects the
additional molecular weight for each modified peptide matched to a spectrum (this
field is for PTM peptides only). On the item details pages, modificationMass is labeled:
Reflects the additional molecular weight for each modified peptide matched to a spectrum.
Methods
ENCODE cell lines K562, GM12878, H1-hESC and H1-neurons were used for this
large scale proteomic analysis. Cell lines were cultured according to standard
ENCODE cell culture protocols
and tryptic peptides were prepared using In-gel digestion (Shevchenko et al.,
2007), FASP (Wiseniewski et al., 2009; Manza et al., 2005)
or MudPIT (Washburn et al., 2001) protocols as indicated for each sample.
Tandem mass spectrometry (RPLC-MS/MS) analysis was then performed on an Eksigent
Ultra-LTQ Orbitrap system or a Q Exactive system (Thermo Scientific) as indicated.*
The number of arginine or lysine sites missed by the trypsin enzyme is indicated by
the metadata parameter miscleavages.
We performed proteogenomic mapping (Jaffe et al., 2004) on an in silico
translation and proteolytic digestion of the whole human genome (UCSC Hg19), and
the GENCODE translation of protein-coding transcripts database with up to
one missed cleavage using
Peppy software. The GENCODE version for H1-hESC (FASP protocol), K562, and
GM12878 is V11 and it is V10 for H1-hESC (MudPIT protocol) and H1-neurons datasets.
GENCODE V11 was initially used for database search and it was later found that
GENCODE V10 is the preferred version and was subsequently used to replace GENCODE
V11 for the analyses of the later datasets. Peppy's embedded algorithm matches
the MS/MS spectra to peptides and outputs a matching score, and the peptides are
then mapped back to their corresponding genomic sequences. The peptide/spectrum
matches (PSMs) found from Hg19 genome and GENCODE searches were compared and the PSMs
of higher score from either matches were reported. If the scores from both matches
are equal, both of them were reported. Additional peptides matches were found by
GENCODE search that were not found in Hg19 genome search, some of which span slice
junctions. Overall, a cross-comparison and inclusion of results from both database
searches resulted in a greater coverage.
For both the Hg19 genome and GENCODE database searches, a blind search for
post-translational modifications (PTMs) was performed using Peppy software.
In a blind PTM search, when Peppy matches a MS/MS spectrum to a peptide, if the
matching score is increased after the addition of the molecular weight (MW) of a
potential PTM, the peptide is determined as having a PTM. In the output of both the
Hg19 genome and GENCODE searches, some spectra were output as matched with peptides
of PTMs and others were output as matched with regular peptides, i.e., peptides
without PTMs. Once the best-ranking PSMs were identified from either search, the
regular peptides and peptides with PTMs were displayed in separate tracks.
For each data set, a reverse database search was also performed using all spectra
to calculate the false discovery rate (FDR) (Elias et al., 2007). Only
those matches with a FDR rate below 5% were included in this track. The
unfiltered results of those peptides matches with an FDR rate below 10% are available
for download.
*H1-hESC (FASP protocol), K562 and GM12878 samples were analyzed on the Eksigent
Ultra LTQ Orbitrap system (Thermo Scientific) whereas H1-hESC (MudPIT protocol),
H1-neurons sample were analyzed on the Q Exactive system (Thermo Scientific).
Release Notes
This is Release 1 of this track (Sept 2012). Unlike other ENCODE data, these data are not archived at GEO but at Proteome Commons. The first 32 digits of the Tranche Hash for each data set is stored as the labExpId.
Credits
Proteogenomic mapping: Dr. John Wrobel, Dr. Jainab Khatun, Mr. Brian Risk,
and Mr. David Thomas (Giddings Lab).
Proteomic analysis: Dr. Yanbao Yu, Dr. Harsha Gunawardena, Dr. Ling Xie and
Ms. Li Wang (Chen Lab).
Main Contact:
John Wrobel
References
Giddings MC, Shah AA, Gesteland R, Moore B.
Genome-based peptide fingerprint scanning.
Proc Natl Acad Sci U S A. 2003 Jan 7;100(1):20-5.
Shevchenko A, Tomas H, Havlis J, Olsen JV, Mann M.
In-gel digestion for mass spectrometric characterization of proteins and proteomes.
Nat Protoc. 2006;1(6):2856-60.
Wisniewski JR, Zougman A, Nagaraj N, Mann M.
Universal sample preparation method for proteome analysis.
Nat Methods. 2009 May;6(5):359-62.
Manza LL, Stamer SL, Ham AJ, Codreanu SG, Liebler DC.
Sample preparation and digestion for proteomic analyses using spin filters.
Proteomics. 2005 May;5(7):1742-5.
Washburn MP, Wolters D, Yates JR 3rd.
Large-scale analysis of the yeast proteome by multidimensional protein identification
technology.
Nat Biotechnol. 2001 Mar;19(3):242-7.
Jaffe JD, Berg HC, Church GM.
Proteogenomic mapping as a complementary method to perform genome annotation.
Proteomics. 2004 Jan;4(1):59-77.
Elias JE, Gygi SP.
Target-decoy search strategy for increased confidence in large-scale protein identifications by mass
spectrometry.
Nat Methods. 2007 Mar;4(3):207-14.
Data Release Policy
Data users may freely use ENCODE data, but may not, without prior
consent, submit publications that use an unpublished ENCODE dataset
until
nine months following the release of the dataset. This date is listed
in
the Restricted Until column above. The full data release policy for ENCODE is available
here.
|
Top⇑ |