UMICollapse icon indicating copy to clipboard operation
UMICollapse copied to clipboard

Remove redundant BAM file open in paired mode

Open siddharthab opened this issue 1 year ago • 3 comments

Fixes #31.

Opening a BAM file is an expensive operation as the index needs to be fully read. In paired reads mode, at every contig change, the file was being opened again to iterate over all reads from the previous contig. This is usually not an issue for genome alignments, but transcriptome alignments may have ~100k contigs, which makes this an expensive operation.

Ideally, the two-pass mode should not have to read the file again, and instead just maintain a rolling window of reads in memory.

siddharthab avatar Sep 04 '24 06:09 siddharthab

With this change, the test case in the linked issue takes 14 minutes now instead of 6.3 hours.

siddharthab avatar Sep 04 '24 07:09 siddharthab

@Daniel-Liu-c0deb0t Can you please accept this PR?

siddharthab avatar Sep 16 '24 22:09 siddharthab

I just wanted to express explicit support for this proposal!

While I am not familiar with the implementation details, I think, it is a very important fix. Transcriptomic alignments or draft genome assemblies typically have numerous contigs and if this fix streamlines the deduplication of those input files so dramatically, I would love to see it merged!

MatthiasZepper avatar Sep 19 '24 12:09 MatthiasZepper