gosling.js icon indicating copy to clipboard operation
gosling.js copied to clipboard

getRelativeGenomicPosition returning 'unknown chromosome' for large intervals because of unaccounted missing sequences

Open thomcsmits opened this issue 1 year ago • 1 comments

Problem getRelativeGenomicPosition returns 'unknown' for the chromosome for a large interval at the end of the chromosomes due to CHROM_SIZES missing unaccounted intervals.

Example used

{
  "title": "Visual Encoding",
  "subtitle": "Gosling provides diverse visual encoding methods",
  "layout": "linear",
  "assembly": "hg16",
  // "xDomain": {"chromosome": "chr1", "interval": [1, 3000500]},
  "views": [
    {
      "tracks": [
        {
          "id": "track-1",
          "data": {
            "url": "https://server.gosling-lang.org/api/v1/tileset_info/?d=cistrome-multivec",
            "type": "multivec",
            "row": "sample",
            "column": "position",
            "value": "peak",
            "categories": ["sample 1", "sample 2", "sample 3", "sample 4"],
            "binSize": 4
          },
          "mark": "rect",
          "x": {"field": "start", "type": "genomic", "axis": "top"},
          "xe": {"field": "end", "type": "genomic"},
          "row": {"field": "sample", "type": "nominal", "legend": true},
          "color": {"field": "peak", "type": "quantitative", "legend": true},
          "tooltip": [
            {"field": "start", "type": "genomic", "alt": "Start Position"},
            {"field": "end", "type": "genomic", "alt": "End Position"},
            {
              "field": "peak",
              "type": "quantitative",
              "alt": "Value",
              "format": ".2"
            },
            {"field": "sample", "type": "nominal", "alt": "Sample"}
          ],
          "width": 600,
          "height": 130
        }
      ]
    }
  ]
}

Examining the data and the chromosome mapping, the entire genomic interval between about 3,088,000,000 up to 3,260,000,000 (total of 172,000,000 positions) map to 'unknown'. Gosling visualization of heatmap showing position after ChrY

A similar large interval of unknown is observed for the basic bar example.

Why I think this is happening Note: this behavior is largely dependent on the data and how it was assembled in the first place Gosling's chromosome sizes don't include unlocalized/unplaced sequences in the same way that IGV does.

Summing the lengths of Gosling (hg38) gives: 3088269832 Summing the lengths of IGV (hg38) gives: 3209286105

Why is this important? All of these unknown sequences do not just go at the end of the chromosome. Leaving out e.g. chr11_KI270721v1_random after chr11 causes all subsequent chromosomes to be mapped incorrectly (and also leaving this large unknown area at the end).

The real problem Not including these positions causes all chromosomes after chr1 to be mapping incorrectly!

Suggested changes Include unlocalized/unplaced sequences and include these in getRelativeGenomicPosition

thomcsmits avatar Dec 22 '23 14:12 thomcsmits