htslib
htslib copied to clipboard
sam_hdr_remove_lines is inefficient if original header contains a lot of lines
Steps to reproduce
- Create an input.bam file with ~1.5M
@RG
header lines. it can contain a single read, it does not really matter - Create a filter.txt file with a single line containing one of the
@RG
values - Run
samtools view -R filter.txt input.bam
Expected command should complete in <1min
Actual command takes ~6 hours before it starts writing to output file
Notes
I dug through the code, the problem seems to be that sam_hdr_remove_lines
calls sam_hrecs_remove_line
for every header line that needs to be removed. sam_hrecs_remove_line
is pretty quick, but it calls sam_hrecs_remove_hash_entry
which removes the entry from the hrecs->rg
array. Since removing an element from an array is O(n)
operation, the process of removing (almost) all the lines becomes O(n^2)
.
If there was a way to "rebuild" the hrecs->rg
array in a single pass once all the lines are removed (instead of updating it for every removed line), the whole process could become much faster.
Yes, this will be slow if you try to remove a huge number of records. It should be possible to make it defer the array rebuild to the end, although we'll have to take care not to access any broken pointers on the way.