Description
This track shows various rearrangements in the HPRC assemblies with respect to hg38. The types include indels, duplications, inversions, and other more complicated
rearrangements. There are five tracks in the Rearrangement composite track:
- Insertions in hg38 with respect to the HPRC genomes
- Deletions in hg38 with respect to the HPRC genomes
- Inversion in hg38 with respect to the HPRC genomes
- Duplications in the HPRC genomes with respect to hg38
- Other Rearrangements: Unalignable sequences in both genomes (inversions, partial transpositions)
Display Conventions
All items are labeled by the number of HPRC assemblies that have the rearrangement. The indel tracks have one or
two additional fields that specify how large the indel is in base pairs.
For the Insertions and Deletions track there's only one number with "bp" after it.
For insertions, it is the size of the insertion in hg38.
For deletions, it is the size of the sequence deleted in hg38.
For the Other Rearrangements track, there are two numbers given: the number of unaligned
bases in hg38 and the number of unaligned bases in the HPRC assemblies.
Methods
All these tracks are built from the HPRC chains and nets.
The actual instructions used to create these tracks are in the files hprcRearrange.txt and hprcInDel.txt.
The first step for all the tracks is to find the orthologous sequences in each HPRC assembly for each chromosome in hg38.
These sequences are called the query sequences. For each query sequence, we select the
longest chain to the hg38 sequence. This is called the orthologous chain.
Following are the specific methods for each track.
Insertions, Deletions, and Others
In each orthologous chain we look for any gaps in either the reference or the query sequence. There are two basic types of gaps.
One type is when the gap contains no bases in one of the two sequences, but one or more unaligned bases in the other.
These indicate a standard insertion in one sequence or a deletion in the other. There are also gaps where there are
unaligned bases in both sequences. These may be alignment errors or sites where more than one rearrangement occurred between the two sequences.
This type of gap is in the "Other Rearrangements" track.
This gap identification is done for each of the HPRC assemblies resulting in a set of indels that are clustered based on exact boundaries of the gap in both sequences.
This kind of clustering often results in indels that "pile up" with a different number of inserted or deleted bases.
Inversions and Duplications
For each orthologous chain, we look for any other chain between the same query sequence and the sequence in hg38 that overlaps the orthologous chain.
Each of those overlaps is determined to be either an inversion or a local duplication in the HPRC genome by
the chainArrange utility.
This is done for each of the HPRC assemblies resulting in a set of
inversion/duplications that are then clustered over all the assemblies.
The clustering is by simple overlap such that no cluster overlaps any other and is done
by the chainArrangeCollect utility.
References
Wen-Wei Liao, Mobin Asri, Jana Ebler, ...et al, Heng Lin,
Benedict Paten
A draft human pangenome reference.
Nature. 2023 May;617(7960):312-324.
PMID: 37165242;
PMC: PMC1017212;
DOI: 10.1038/s41586-023-05896-x
Glenn Hickey, Jean Monlong, Jana Ebler, Adam M Novak, Jordan M Eizenga,
Yan Gao; Human Pangenome Reference Consortium; Tobias Marschall, Heng Li,
Benedict Paten
Pangenome graph construction from genome alignments with Minigraph-Cactus.
Nature Biotechnology. 2023 May 10. doi: 10.1038/s41587-023-01793-w.
PMID: 37165083;
DOI: 10.1038/s41587-023-01793-w
Armstrong J, Hickey G, Diekhans M, Fiddes IT, Novak AM, Deran A, Fang Q,
Xie D, Feng S, Stiller J
et al.
Progressive Cactus is a multiple-genome aligner for the thousand-genome era.
Nature. 2020 Nov;587(7833):246-251.
PMID: 33177663;
PMC: PMC7673649;
DOI: 10.1038/s41586-020-2871-y
Paten B, Earl D, Nguyen N, Diekhans M, Zerbino D, Haussler D.
Cactus: Algorithms for genome multiple sequence alignment.
Genome Res. 2011 Sep;21(9):1512-28.
PMID: 21665927;
PMC: PMC3166836;
DOI: 10.1101/gr.123356.111
|
|