jbrowse-components icon indicating copy to clipboard operation
jbrowse-components copied to clipboard

Optimize SNPCoverage rendering on large regions

Open cmdcolin opened this issue 2 years ago • 6 comments

I mentioned recently that arc view would sort of put pressure on our code to optimize other parts of our system, since arc track is often useful to view in a zoomed out view. One of those parts of the system is snpcoverage, which is useful to display alongside arc track to see e.g. a drop in coverage

here is a a 600kb region with ~100x illumina reads from a BAM file. not small data, but it is useful to render everything as we can see it Screenshot from 2022-12-12 15-48-18

the above screenshot takes about a minute to render, with ~1/2 of the time occupied by generateCoverageBins (figure from speedoscope with chrome profiler) Screenshot from 2022-12-12 15-48-54

indeed, the code would need to allocate an array of 600,000 objects for snpcoverage because we don't perform binning. we could try to figure out some binning though and it may improve performance

cmdcolin avatar Dec 12 '22 22:12 cmdcolin

possibly related https://github.com/GMOD/jbrowse-components/issues/2814 at least, it is affected because if you scroll a little, it can cause a big minute long re-render

cmdcolin avatar Dec 12 '22 22:12 cmdcolin

Tried looking at a ~1Mbp region of RNA-seq data w/ @carolinebridge-oicr and it was very slow. The code likely needs to 'bin' the data instead of per-base statistics for the coverage as 1Mbp is a million little javascript objects with more nested javascript objects, gets slow

cmdcolin avatar May 01 '23 18:05 cmdcolin

The reason for zooming out to this large area was that the RNA-seq data was very sparse, and trying to find 'interesting data' in region. We could also consider 'semantically changing to e.g. a bigwig or indexcov' at that scale, but i think the js could still be optimized

cmdcolin avatar May 01 '23 18:05 cmdcolin

see https://github.com/brentp/mosdepth#how-it-works as a possible method to optimize our snpcoverage.

for example, the pseudocode way the way our code currently works is:

for(const read of reads):
   for(const position in read.start...read.end):
       increment bin[position]

therefore, if the average read length is 100bp, and the number of reads is N, then the runtime is 100*N

the mosdepth algorithm only increments and decrements at the start and end of the read so it's like 2*N, which is probably a substantial speed up

cmdcolin avatar Nov 24 '23 20:11 cmdcolin

can look at http://jbrowse.org/code/jb2/main/?config=test_data%2Fconfig_demo.json&session=share-1qNYykeMaS&password=cg9jv to see long read rna-sequencing (e.g. iso-seq but nanopore) and it is fairly performance intensive, would be good to optimize and a lot of the time is spent on generateCoverageBins from snpcoverage

cmdcolin avatar Dec 21 '23 00:12 cmdcolin

(link above not yet valid, but will be once site is deployed)

cmdcolin avatar Dec 21 '23 00:12 cmdcolin