Description
The UCSC Genes track is a set of gene predictions based on data from RefSeq, GenBank, CCDS,
Rfam, and the tRNA Genes track. The track
includes both protein-coding genes and
non-coding RNA genes. Both types of genes can produce non-coding transcripts, but non-coding
RNA genes do not produce protein-coding transcripts. This is a moderately conservative set of
predictions. Transcripts of protein-coding genes require the support of one RefSeq RNA, or one
GenBank RNA sequence plus at least one additional line of evidence. Transcripts of non-coding RNA
genes require the support of one Rfam or tRNA prediction. Compared to RefSeq, this gene set has
generally about 10% more protein-coding genes, approximately four times as many putative non-coding
genes, and about twice as many splice variants.
For more information on the different gene tracks, see our Genes FAQ.
Display Conventions and Configuration
This track in general follows the display conventions for
gene prediction tracks. The exons
for putative non-coding genes and untranslated regions are represented by relatively thin blocks,
while those for coding open reading frames are thicker. The following color key is used:
- Black -- feature has a corresponding entry in the Protein
Data Bank (PDB)
- Dark blue -- transcript has been reviewed
or validated by either the RefSeq, SwissProt or CCDS staff
- Medium blue -- other RefSeq transcripts
- Light blue -- non-RefSeq transcripts
This track contains an optional codon coloring
feature that allows users to quickly validate and compare gene predictions.
Methods
The UCSC Genes are built using a multi-step pipeline:
- RefSeq and GenBank RNAs are aligned to the genome with BLAT, keeping only the best alignments
for each RNA. Alignments are discarded if they do not meet certain sequence identity and coverage
filters. All sequences must align with high (98%) identity. The sequence coverage must be at least
90% for shorter sequences (those with 2500 or fewer bases), with the coverage threshold
progressively relaxed for longer sequences.
- Alignments are broken up at non-intronic gaps, with small isolated fragments thrown out.
- A splicing graph is created for each set of overlapping alignments. This graph has an edge
for each exon or intron, and a vertex for each splice site, start, and end. Each RNA that
contributes to an edge is kept as evidence for that edge. Gene models from the Consensus CDS project
(CCDS) are also added to the graph.
- A similar splicing graph is created in the mouse, based on mouse RNA and ESTs. If the mouse
graph has an edge that is orthologous to an edge in the human graph, that is added to the evidence
for the human edge.
- If an edge in the splicing graph is supported by two or more human ESTs, it is added as
evidence for the edge.
- If there is an Exoniphy prediction for an exon, that is added as evidence.
- The graph is traversed to generate all unique transcripts. The traversal is guided by the
initial RNAs to avoid a combinatorial explosion in alternative splicing. All RefSeq transcripts are
output. For other multi-exon transcripts to be output, an edge supported by at least one additional
line of evidence beyond the RNA is required. Single-exon genes require either two RNAs or two
additional lines of evidence beyond the single RNA.
- Alignments are merged in from the hg19
tRNA Genes track and from Rfam
in regions that are syntenic with the mm9 mouse genome.
- Protein predictions are generated. For non-RefSeq transcripts we use the txCdsPredict program to
determine if the transcript is protein-coding, and if so, the locations of the start and stop codons.
The program weighs as positive evidence the length of the protein, the presence of a Kozak consensus
sequence at the start codon, and the length of the orthologous predicted protein in other species.
As negative evidence it considers nonsense-mediated decay and start codons in any frame upstream of
the predicted start codon. For RefSeq transcripts the RefSeq protein prediction is used directly
instead of this procedure. For CCDS proteins the CCDS protein is used directly.
- The corresponding UniProt protein is found, if any.
- The transcript is assigned a permanent "uc" accession. If the transcript was not in
the previous release of UCSC Genes, the accession ends with the suffix ".1" indicating
that this is
the first version of this transcript. If the transcript is identical to some transcript in the
previous release of UCSC Genes, the accession is re-used with the same version number. If the
transcript is not identical to any transcript in the previous release but it overlaps a similar
transcript with a compatible structure, the previous accession is re-used with the version number
incremented.
Related Data
The UCSC Genes transcripts are annotated in numerous tables, each of which is also available as a
downloadable file. These
include tables that link UCSC Genes transcripts to external datasets (such as
knownToLocusLink, which maps UCSC Genes transcripts to Entrez identifiers, previously known
as Locus Link identifiers), and tables that detail some property of UCSC Genes transcript sequences
(such as knownToPfam, which identifies any Pfam domains found in the UCSC Genes
protein-coding transcripts). One can see a full list of the associated tables in the
Table Browser by selecting UCSC Genes at the track menu;
this list is then available at the table menu. Note that some of these tables refer to UCSC
Genes by its former name of Known Genes, sometimes abbreviated as known or kg.
While the complete set of annotation tables is too long to describe, some of the more important
tables are described below.
- kgXref identifies the RefSeq, SwissProt, Rfam, or tRNA sequences (if any) on which each
transcript was based.
- knownToRefSeq identifies the RefSeq transcript that each UCSC Genes transcript is most
closely associated with. That RefSeq transcript is either the RefSeq on which the UCSC Genes
transcript was based, if there is one, or it's the RefSeq transcript that the UCSC Genes transcript
overlaps at the most bases.
- knownGeneMrna contains the mRNA sequence that represents each UCSC Genes transcript. If
the transcript is based on a RefSeq transcript, then this table contains the RefSeq transcript,
including any portions that do not align to the genome.
- knownGeneTxMrna contains mRNA sequences for each UCSC Genes transcript. In contrast to
the sequencess in knownGeneMrna, these sequences are derived by obtaining the sequences for each exon
from the reference genome and concatenating these exonic sequences.
- knownGenePep contains the protein sequences derived from the knownGeneMrna transcript
sequences. Any protein-level annotations, such as the contents of the knownToPfam table, are based
on these sequences.
- knownGeneTxPep contains the protein translation (if any) of each mRNA sequence in
knownGeneTxMrna.
- knownIsoforms maps each transcript to a cluster ID, a cluster of isoforms of
the same gene.
- knownCanonical identifies the canonical isoform of each cluster ID, or gene. Generally,
this is the longest isoform.
Data access
UCSC Genes (knownGene for hg19) can be explored interactively using the
REST API, the
Table Browser or the
Data Integrator.
The genePred files for hg19 are available in our
downloads directory or in our
genes downloads directory in GTF format.
All the tables can also be queried directly from our public MySQL
servers. Information on accessing this data through MySQL can be found on our
help page as well as on
our blog.
Credits
The UCSC Genes track was produced at UCSC using a computational pipeline developed by Jim Kent,
Chuck Sugnet, Melissa Cline and Mark Diekhans. It is based on data from NCBI
RefSeq,
UniProt
(including TrEMBL and TrEMBL-NEW),
CCDS, and
GenBank as well as data from
Rfam and
the Todd Lowe lab.
Our thanks to the people running these databases and to the scientists worldwide who have made
contributions to them.
References
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL.
GenBank: update.
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D23-6.
PMID: 14681350; PMC: PMC308779
Chan PP, Lowe TM.
GtRNAdb: a database of transfer RNA genes detected in genomic sequence.
Nucleic Acids Res. 2009 Jan;37(Database issue):D93-7.
PMID: 18984615; PMC: PMC2686519
Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL,
Eddy SR et al.
Rfam: Wikipedia, clans and the "decimal" release.
Nucleic Acids Res. 2011 Jan;39(Database issue):D141-5.
PMID: 21062808; PMC: PMC3013711
Hsu F, Kent WJ, Clawson H, Kuhn RM, Diekhans M, Haussler D.
The UCSC Known Genes.
Bioinformatics. 2006 May 1;22(9):1036-46.
PMID: 16500937
Kent WJ.
BLAT - the BLAST-like alignment tool.
Genome Res. 2002 Apr;12(4):656-64.
PMID: 11932250; PMC: PMC187518
Lowe TM, Eddy SR.
tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence.
Nucleic Acids Res. 1997 Mar 1;25(5):955-64.
PMID: 9023104; PMC: PMC146525
UniProt Consortium.
Reorganizing the protein space at the Universal Protein Resource (UniProt).
Nucleic Acids Res. 2012 Jan;40(Database issue):D71-5.
PMID: 22102590; PMC: PMC3245120
|
|