- Home
- Genomes
- Genome Browser
- Tools
- Mirrors
- Downloads
- My Data
- Projects
- Help
- About Us
The UCSC Genome Browser aims to support researchers operating in the cloud. Topics such as data access on the cloud, including details about the API and Amazon s3://genome-browser bucket, software installations for cloud computing, and references to helpful tools are discussed on this page.
S3 stands for Simple Storage Service, and it is the name for cloud storage in Amazon Web Services (AWS). The data available through S3 is essentially stored in a folder called a bucket, and files are called objects. The s3://genome-browser bucket is a copy of the main data available on our UCSC Genome Browser Download website: https://hgdownload.soe.ucsc.edu/downloads.html
By placing our Download server files in an S3 bucket, developers working in the cloud can more easily integrate with UCSC data. You can learn more about how S3-object-based storage works, and its advantages of being accessible anywhere across the world with low latency and high durability by reviewing Amazon's S3 documentation.
The data mirrors our UCSC Genome Browser Download website's main rsync directories:
UCSC Human Golden Path Downloads s3://genome-browser/goldenPath UCSC Human Genome Browser Gbdb Data Files s3://genome-browser/gbdb UCSC Human Genome Raw Mysql Tables s3://genome-browser/mysql UCSC Human Genome Web Site CGI Binaries s3://genome-browser/cgi-bin UCSC Human Genome Web Site Htdocs s3://genome-browser/htdocs
goldenPath/hg38/bigZips/README.txt
. The README.txt, also
available on the Download website,
informs that the most recent patch-inclusive sequence is found in
goldenPath/hg38/bigZips/latest/
.gbdb/hg38/hg38.2bit
, matching the file in the
goldenPath/hg38/bigZips/latest/
directory, reflecting how these files are operated on by the UCSC Genome Browser software
in order to display assembly sequence when browsing.htdocs/goldenPath/pubs.html
which lists our publications.
Amazon provides an AWS
Command Line Interface (AWS CLI) which includes options such as sync.
Here is an example to download an AWS bucket with CLI: aws s3 sync s3://bucket-name .
The data is also available via http at genome-browser.s3-website-us-east-1.amazonaws.com where files can be accessed.
goldenPath/
Downloads directory:gbdb/
binary data directory
for the human hg38 assembly 2bit file:htdocs/
hypertext document directory:
http://genome-browser.s3-website-us-east-1.amazonaws.com/htdocs/goldenPath/pubs.htmlThe UCSC Genome Browser has a REST API for the programmatic extraction of data. REST is an acronym for REpresentational State Transfer and API stands for Application Programming Interface, read more on the help page: http://genome.ucsc.edu/goldenPath/help/api.html
The REST API returns data in JavaScript Object Notation (JSON) format, which can easily be sent between computers, and used by many different programming languages.
Data can be accessed with this URL: https://api.genome.ucsc.edu/ By adding
different endpoint functions such as /list/
or /getData/
specific results can be obtained.
wget -O- 'https://api.genome.ucsc.edu/list/publicHubs' wget -O- 'https://api.genome.ucsc.edu/getData/sequence?genome=hg38;chrom=chrM;start=4321;end=5678'
With different endpoint functions such as /list/
or
/getData/
URLs can be constructed to pull specific results.
Endpoint function | Required | Optional |
---|---|---|
/list/publicHubs | (none) | (none) |
/list/ucscGenomes | (none) | (none) |
/list/hubGenomes | hubUrl | (none) |
/list/tracks | genome or (hubUrl and genome) | trackLeavesOnly=1 |
/list/chromosomes | genome or (hubUrl and genome) | track |
/list/schema | (genome or (hubUrl and genome)) and track | (none) |
/getData/sequence | (genome or (hubUrl and genome)) and chrom | start and end |
/getData/track | (genome or (hubUrl and genome)) and track | chrom, (start and end), maxItemsOutput, jsonOutputArrays |
By reviewing example data access URLs demonstrating of list and getData functions and further practical examples URLs of extracting specific track data items you can learn more about the ways of using the API to extract data.
The UCSC Genome Browser Download website, hgdownload.soe.ucsc.edu, is the source of the data hosted in the Amazon s3://genome-browser bucket. It can be viewed in a web browser to access specific download files, or the data can be copied with rysnc commands.
For instance, the following rsync command will show you the various rysnc directories available on our Download server:
$ rsync -a -P rsync://hgdownload.soe.ucsc.edu/ genome UCSC Human Genome Downloads sars UCSC Human Genome SARS Downloads htdocs UCSC Human Genome Web Site Htdocs goldenPath UCSC Human Golden Path Downloads cgi-bin UCSC Human Genome Web Site CGI Binaries x86_64 cgi-bin-i386 UCSC Human Genome Web Site CGI Binaries i386 gbdb UCSC Human Genome Browser Gbdb Config Files archives UCSC Human Genome Browser Archived Config Files mysql UCSC Human Genome Raw Mysql Tables gbib UCSC Genome Browser in a Box hubs UCSC Genome Browser Public Hubs
goldenPath/
Downloads directory:rsync -a -P rsync://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/README.txt ./
gbdb/
binary data directory
for the human hg38 assembly 2bit file:rsync -a -P rsync://hgdownload.soe.ucsc.edu/gbdb/hg38/hg38.2bit ./
htdocs/
hypertext document directory:
rsync -a -P rsync://hgdownload.soe.ucsc.edu/htdocs/goldenPath/pubs.html ./
Many of these rsync directories exist to support the Genome Browser in a Cloud (GBiC) and the Genome Browser in a Box (GBiB) software products discussed below.
Also note that there is a mirror of the download server available in Europe so the above rysnc
commands can also be pointed to the hgdownload-euro
locations.
rsync -a -P rsync://hgdownload-euro.soe.ucsc.edu/gbdb/hg38/hg38.2bit ./
The UCSC Genome Browser uses MariaDB (fork of MySQL) as the backend database server and maintains a public server at genome-mysql.soe.ucsc.edu to allow direct queries.
trackDb
all the entries in the group (grp) "genes" and
ordering those entries by tableName:
mysql -h genome-mysql.soe.ucsc.edu -u genome -NBe 'select tableName from trackDb where grp = "genes" order by tableName' hg38
wgEncodeRegTfbsClusteredV3
on the human hg19 assembly
and selecting entries from a 500 base pair region on chr1:
mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -Ne 'select chrom,chromStart,chromEnd,name,score
from wgEncodeRegTfbsClusteredV3 where chrom = "chr1" and chromStart > 10000 and chromEnd < 10500;' hg19
wgEncodeGencodeBasicV39
table on the hg38 genome:
mysql -u genome -h genome-mysql.soe.ucsc.edu hg38 -e 'select g.name,a.transcriptType from wgEncodeGencodeBasicV39 g,
wgEncodeGencodeAttrsV39 a where (g.name = a.transcriptId) and (a.transcriptType = "lncRNA");'
See the Downloading Data using MariaDB (MySQL)
for more information. Also, there is a mirror of the MariaDb server available
in Europe so commands can also be pointed to the genome-euro-mysql
location.
mysql -h genome-mysql-euro.soe.ucsc.edu -u genome -NBe 'show tables' hg38
To replicate, or mirror, the software of the UCSC Genome Browser in another location we offer the Genome Browser in a Cloud (GBiC) and the Genome Browser in a Box (GBiB) software products.
The GBiC is an installation script that automates the setup of a UCSC Genome Browser mirror including setting up MariaDB and Apache servers. The program downloads and configures MySQL and Apache, and then downloads the UCSC Genome Browser software to /usr/local/apache to make a local instance of the Browser.
The GBiB is a small virtual machine version of the UCSC Genome Browser that can be run on a laptop or desktop computer. It requires an installation of a compatible version of the VirtualBox Software, and will then access annotation data on demand through the Internet from UCSC as used, or selective data can be downloaded for faster access.
The GBiB and GBiC software tools resource the
Download server to rsync
data, as well as in certain circumstances the
MySQL server to extract
coordinate-specific table data.
See the individual support pages for the GBiC and the GBiB for detailed information about how to install and operate both. You can get either the GBiC or the GBiB from the UCSC Genome Browser store free for non-commercial use.
We do support a Dockerfile, that in essence points to the GBiC installation script. While we recommend our GBiC script, we understand many people are more familiar with working through Docker and provide Docker installation instructions.
Please note, similar to how our GBiB and GBiC are available in the UCSC Genome Browser store, where usage of our mirror software is free for non-commercial use. Any commercial usage, including through the Docker image, involves a license.
A lot of our data is stored in a binary indexed version called bigBed. This format saves space and also allows the extraction of information based on the first three fields (chrom, chromStart, chromEnd), which define annotation coordinate location.
To pull information out of bigBed files there is a tool called bigBedToBed
.
By running the command by itself you can see the command options.
bigBedToBed v1 - Convert from bigBed to ascii bed format. usage: bigBedToBed input.bb output.bed options: -chrom=chr1 - if set restrict output to given chromosome -start=N - if set, restrict output to only that over start -end=N - if set, restict output to only that under end -maxItems=N - if set, restrict output to first N items -udcDir=/dir/to/cache - place to put cache for remote bigBed/bigWigs -header - output a autoSql-style header (starts with '#').
Another similar tool is available to extract data from the binary indexed 2bit sequence
storage format. The tool twoBitToFa
can be given coordinate ranges and the
DNA can be extracted from the file.
twoBitToFa -seq=chr1 -start=1234500 -end=1234600 http://genome-browser.s3-website-us-east-1.amazonaws.com/gbdb/hg38/hg38.2bit stdout >chr1:1234500-1234600 GCGTCCCTAGGTCAGGCCGTTGAGTTCGAGCTCCGATGGGCCACCTTGAA TCCAGGACTGACCGCCCGTGTGTGCACAGTTTGTTCTTGGACGAGGACTC
bigBedToBed -chrom=chr1 -start=190000 -end=200000 http://genome-browser.s3-website-us-east-1.amazonaws.com/gbdb/hg38/encode3/ccre/encodeCcreCombined.bb stdout | head chr1 190865 191071 EH38E1310154 179 . 190865 191071 255,205,0 dELS,CTCF-boundd ELS 1.79282201562 enhDE1310154 EH38E1310154 distal enhancer-like signature
The Amazon Ecosystem comes integrated with a collection of systems such as CloudFront, CloudWatch, Relational Database Service (RDS), Elastic Block Store (EBS), Lambda, and Aurora. Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud. The UCSC Genome Browser's tableName.MYD and tableName.MYI files can be used with Aurora, instead of installing MariaDb, however, there may be some services costs in Amazon for using Aurora.