UMICollapse
UMICollapse copied to clipboard
Remove redundant BAM file open in paired mode
Fixes #31.
Opening a BAM file is an expensive operation as the index needs to be fully read. In paired reads mode, at every contig change, the file was being opened again to iterate over all reads from the previous contig. This is usually not an issue for genome alignments, but transcriptome alignments may have ~100k contigs, which makes this an expensive operation.
Ideally, the two-pass mode should not have to read the file again, and instead just maintain a rolling window of reads in memory.
With this change, the test case in the linked issue takes 14 minutes now instead of 6.3 hours.
@Daniel-Liu-c0deb0t Can you please accept this PR?
I just wanted to express explicit support for this proposal!
While I am not familiar with the implementation details, I think, it is a very important fix. Transcriptomic alignments or draft genome assemblies typically have numerous contigs and if this fix streamlines the deduplication of those input files so dramatically, I would love to see it merged!