modkit icon indicating copy to clipboard operation
modkit copied to clipboard

Chunk process for summary and sample-prob

Open KunFang93 opened this issue 10 months ago • 3 comments

Hi @ArtRand,

I was wondering if it might be possible to add chunk-based processing (similar to the pileup method) for the –no-sampling option in summary and sample-prob in the future. Currently, the –no-sampling option is very resource-intensive—in my case, processing 150,000 reads requires around 60GB of RAM. Because my modifications are sparse, –no-sampling seems the only viable option I have. While I can work around this by splitting my BAM file into smaller segments and then aggregating the results, it would be ideal if the –no-sampling option could incorporate chunk processing strategy like pileup in the future.

Thanks for your help!

Best, Kun

KunFang93 avatar Feb 27 '25 19:02 KunFang93

Hello @KunFang93,

That's a good idea, both of those commands are due for a little refresh. One caution about splitting the bam, depending on how you're doing it, you can have reads that get counted in two splits if they span the gap. Another option is to use modkit extract calls and pipe the table through another filter that calculates the statistics per-read. All of the rows for a read will come out together, so you can operate on each read at once, calculate the %-modified, etc.

Calculating the pass thresholds is a little more complicated. Right now the percentiles are naively, but exactly. I can already think of a few ways to be more clever about calculating the percentiles without using as much memory. Thanks for the use case and the pressure, I'll see what I can do.

ArtRand avatar Feb 28 '25 23:02 ArtRand

Thanks for your suggestion! I will try it. Looking forward to seeing the new tricks in old functions :)

KunFang93 avatar Mar 01 '25 21:03 KunFang93

Reopening this to track the work.

ArtRand avatar Mar 05 '25 14:03 ArtRand