Problematic Regions GIAB Problematic Regions Track Settings
 
Difficult regions from GIAB via NCBI

Track collection: Problematic/special genomic regions for sequencing or very variable regions

+  Description
+  All tracks in this collection (3)

Display mode:       Reset to defaults

Hide empty subtracks:  

List subtracks: only selected/visible    all  
hide
 All difficult regions  Genome In a Bottle: all difficult regions   Data format 
hide
 LowMap+SegDup  Genome In a Bottle: lowMap+SegDup regions   Data format 
hide
 Not difficult regions  Genome In a Bottle: not difficult regions   Data format 
hide
 Not lowMap+SegDup  Genome In a Bottle: not lowMap+SegDup mapping regions   Data format 
Assembly: Human Dec. 2013 (GRCh38/hg38)


new Note: November 4, 2024

Description

This container track helps call out sections of the genome that often cause problems or confusion when working with the genome. The hg19 genome has a track with the same name, but with many more subtracks, as the GeT-RM and Genome-in-a-Bottle artifact variants do not exist yet for hg38, to our knowledge. If you are missing a track here that you know from hg19 and have an idea how to add it hg38, do not hesitate to contact us.

Problematic Regions

The Problematic Regions track contains the following subtracks:

  • The UCSC Unusual Regions subtrack contains annotations collected at UCSC, put together from other tracks, our experiences and support email list requests over the years. For example, it contains the most well-known gene clusters (IGH, IGL, PAR1/2, TCRA, TCRB, etc) and annotations for the GRC fixed sequences, alternate haplotypes, unplaced contigs, pseudo-autosomal regions, and mitochondria. These loci can yield alignments with low-quality mapping scores and discordant read pairs, especially for short-read sequencing data. This data set was manually curated, based on the Genome Browser's assembly description, the FAQs about assembly, and the NCBI RefSeq "other" annotations track data.
  • The ENCODE Blacklist subtrack contains a comprehensive set of regions which are troublesome for high-throughput Next-Generation Sequencing (NGS) aligners. These regions tend to have a very high ratio of multi-mapping to unique mapping reads and high variance in mappability due to repetitive elements such as satellite, centromeric and telomeric repeats.
  • The GRC Exclusions subtrack contains a set of regions that have been flagged by the GRC to contain false duplications or contamination sequences. The GRC has now removed these sequences from the files that it uses to generate the reference assembly, however, removing the sequences from the GRCh38/hg38 assembly would trigger the next major release of the human assembly. In order to help users recognize these regions and avoid them in their analyses, the GRC have produced a masking file to be used as a companion to GRCh38, and the BED file is available from the GenBank FTP site.

Highly Reproducible Regions

The Highly Reproducible Regions track highlights regions and variants from eight samples that can be used to assess variant detection pipelines. The "Highly Reproducible Regions" subtrack comprises the intersection of the reproducible regions across all eight samples, while the "Variants" subtracks contain the reproducible variants from each assayed sample. Both tracks contain data from the following samples:

  • a Chinese Quartet, samples CQ-5, CQ-6, CQ-7, CQ-8
  • a HapMap Trio, samples NA10385, NA12248, NA12249
  • a Genome in a Bottle sample, NA12878s
Please refer to the Pan et al reference for more information on how these regions were defined.

GIAB Problematic Regions

The Genome in a Bottle (GIAB) Problematic Regions tracks provide stratifications of the genome to evaluate variant calls in complex regions. It is designed for use with Global Alliance for Genomic Health (GA4GH) benchmarking tools like hap.py and includes regions with low complexity, segmental duplications, functional regions, and difficult-to-sequence areas. Developed in collaboration with GA4GH, the Genome in a Bottle (GIAB) consortium, and the Telomere-to-Telomere Consortium (T2T), the dataset aims to standardize the analysis of genetic variation by offering pre-defined BED files for stratifying true and false positives in genomic studies, facilitating accurate assessments in complex areas of the genome.

The creation of the GIAB Problematic Regions tracks involves using a pipeline and configuration to generate stratification BED files that categorize genomic regions based on specific challenges, such as low complexity or difficult mapping, to facilitate accurate benchmarking of variant calls. For more information on the pipeline and configuration used, please visit the following webpage: https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/release/genome-stratifications/v3.5/README.md. If you have questions or comments, please write to Justin Zook ([email protected]).

Display Conventions and Configuration

Each track contains a set of regions of varying length with no special configuration options. The UCSC Unusual Regions track has a mouse-over description, all other tracks have at most a name field, which can be shown in pack mode. The tracks are usually kept in dense mode.

The Hide empty subtracks control hides subtracks with no data in the browser window. Changing the browser window by zooming or scrolling may result in the display of a different selection of tracks.

Data access

The raw data can be explored interactively with the Table Browser or the Data Integrator.

For automated download and analysis, the genome annotation is stored in bigBed files that can be downloaded from our download server. Individual regions or the whole genome annotation can be obtained using our tool bigBedToBed which can be compiled from the source code or downloaded as a precompiled binary for your system. Instructions for downloading source code and binaries can be found here. The tool can also be used to obtain only features within a given range, e.g.
bigBedToBed http://hgdownload.soe.ucsc.edu/gbdb/hg38/problematic/comments.bb -chrom=chr21 -start=0 -end=100000000 stdout

Methods

Files were downloaded from the respective databases and converted to bigBed format. The procedure is documented in our hg38 makeDoc file.

Credits

Thanks to Anna Benet-Pagès, Max Haeussler, Angie Hinrichs, Daniel Schmelter, and Jairo Navarro at the UCSC Genome Browser for planning, building, and testing these tracks. The underlying data comes from the ENCODE Blacklist and some parts were copied manually from the HGNC and NCBI RefSeq tracks.

References

Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep. 2019 Jun 27;9(1):9354. PMID: 31249361; PMC: PMC6597582

Dwarshuis N, Kalra D, McDaniel J, Sanio P, Alvarez Jerez P, Jadhav B, Huang WE, Mondal R, Busby B, Olson ND et al. The GIAB genomic stratifications resource for human reference genomes. Nat Commun. 2024 Oct 19;15(1):9029. PMID: 39424793; PMC: PMC11489684

Krusche P, Trigg L, Boutros PC, Mason CE, De La Vega FM, Moore BL, Gonzalez-Porta M, Eberle MA, Tezak Z, Lababidi S et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat Biotechnol. 2019 May;37(5):555-560. PMID: 30858580; PMC: PMC6699627

Pan B, Ren L, Onuchic V, Guan M, Kusko R, Bruinsma S, Trigg L, Scherer A, Ning B, Zhang C et al. Assessing reproducibility of inherited variants detected with short-read whole genome sequencing. Genome Biol. 2022 Jan 3;23(1):2. PMID: 34980216; PMC: PMC8722114