ChromosomeMappings icon indicating copy to clipboard operation
ChromosomeMappings copied to clipboard

Contigs missing

Open mikecormier opened this issue 6 years ago • 8 comments
trafficstars

The GRCh38_UCSC2ensembl.txt file is missing contig mapping from the hg38 UCSC side. In using this file to remap UCSC contigs to Ensembl the map fails because of missing contigs.

For example, chr10_KN196480v1_fix, chr10_KQ090021v1_fix, chr11_KN196481v1_fix, etc. are all within the file being remapped, but these contigs are not in GRCh38_UCSC2ensembl.txt.

I am unaware of other files that may be missing updated contigs, but there may be a few.

Could you update the GRCh38_UCSC2ensembl.txt file, and potentially other files that are missing updated contigs?

mikecormier avatar Aug 13 '19 23:08 mikecormier

UCSC is really annoying since it doesn't have coherent releases (e.g., with a release number). If you know what should be matched together then please submit a PR.

dpryan79 avatar Aug 19 '19 14:08 dpryan79

hey @dpryan79, what is your approach to identifying which contig from UCSC matches that in Ensembl?

mikecormier avatar Dec 20 '19 18:12 mikecormier

NCBI hosts a file that has a variety of chromosome naming system suggestions, so I use that and compare the chromosome lengths to ensure they match.

dpryan79 avatar Jan 02 '20 13:01 dpryan79

Note that there are patch contigs added over time, so these have to be updated every year or two.

dpryan79 avatar Jan 02 '20 13:01 dpryan79

The latest UCSC patch is patch 12. Does the NCBI file contain the contigs from these patches?

mikecormier avatar Jan 14 '20 23:01 mikecormier

Quite likely, yes.

dpryan79 avatar Jan 15 '20 07:01 dpryan79

You may want to check out the CollectAlternateContigName tool that parses the NCBI assembly reports: https://github.com/fulcrumgenomics/fgbio/blob/master/src/main/scala/com/fulcrumgenomics/fasta/CollectAlternateContigNames.scala#L108-L137. It stores the mappings in a valid SAM format (SAM header with sequence aliases), which supports multiple aliases. For example, Genbank, RefSeq, Ensembl, UCSC-style, and "assigned-molecule". It gives a little more control over which molecules to re-map and which names and aliases to use. There are a few UpdateContigName tools as well in the latest master that can update various file formats. I hope that helps.

As an aside, @dpryan79 any interest in a second repo with .dict files, or side-by-side .dict files? I'd be happy to start adding them if you have the source NCBI assembly reports.

nh13 avatar May 19 '20 23:05 nh13

@nh13 Either side-by-side or a subdirectory would work IMO.

dpryan79 avatar May 20 '20 21:05 dpryan79