perbase icon indicating copy to clipboard operation
perbase copied to clipboard

from_pileup_mate_aware very inefficient?

Open brentp opened this issue 2 years ago • 2 comments

Hi Seth, if I understand correctly, from_pileup_mate_aware is run for each column in a pileup. This means it is grouping, sorting, grouping, etc for each position, often with the same sets of reads for each consecutive column.

Do I misread? If not then this will be an extreme bottleneck as it's O(n^2) or worse. We can mitigate by checking if the end of the left-most read is less than the start of the right-most read. Then at least the cost could be minimized to only the percent of reads that overlap.

brentp avatar May 30 '23 16:05 brentp

mosdepth uses such a check (if read is strictly less than mate) and if not, it saves the read into a hashtable. I think that can work here, but I'm not sure if I miss something about the implementation.

brentp avatar May 30 '23 16:05 brentp

Nope, you aren't missing anything! That is how it does it and it for sure slows things down. I do like the idea in your PR if the functions from htslib fall into place.

sstadick avatar Jun 08 '23 16:06 sstadick