htslib icon indicating copy to clipboard operation
htslib copied to clipboard

sam_hdr_remove_lines is inefficient if original header contains a lot of lines

Open astaric opened this issue 2 years ago • 1 comments

Steps to reproduce

  1. Create an input.bam file with ~1.5M @RG header lines. it can contain a single read, it does not really matter
  2. Create a filter.txt file with a single line containing one of the @RG values
  3. Run samtools view -R filter.txt input.bam

Expected command should complete in <1min

Actual command takes ~6 hours before it starts writing to output file

Notes I dug through the code, the problem seems to be that sam_hdr_remove_lines calls sam_hrecs_remove_line for every header line that needs to be removed. sam_hrecs_remove_line is pretty quick, but it calls sam_hrecs_remove_hash_entry which removes the entry from the hrecs->rg array. Since removing an element from an array is O(n) operation, the process of removing (almost) all the lines becomes O(n^2).

If there was a way to "rebuild" the hrecs->rg array in a single pass once all the lines are removed (instead of updating it for every removed line), the whole process could become much faster.

astaric avatar Jun 23 '22 11:06 astaric

Yes, this will be slow if you try to remove a huge number of records. It should be possible to make it defer the array rebuild to the end, although we'll have to take care not to access any broken pointers on the way.

daviesrob avatar Jun 24 '22 16:06 daviesrob