htslib icon indicating copy to clipboard operation
htslib copied to clipboard

Delay closing the index file when indexing on-the-fly

Open daviesrob opened this issue 4 months ago • 0 comments

This is to ensure the timestamp on the index file is later than the one on the file being indexed, preventing spurious "The index file is older than the data file" messages when it's used. The delay is necessary because the main file EOF block may not have been written when hts_idx_save_as() has been called.

Reworks the idx_save functions to add one that keeps the index handle open, storing it in the hts_idx_t struct. hts_close() checks for this, and closes the index file if it finds one after having closed the file is was passed. Unfortunately this means hts_close() will report any errors that happen when the index file is closed. To reduce the chance of that happening, the index writer calls bgzf_flush() to reduce the amount of work that the final bgzf_close() on the index has to do.

An unfortunate wrinkle is that to set the timestamp on the index file, we need to ensure some data is written just before the file is closed. This is find for CSI indexes as they're BGZF compressed and we write an EOF block. For uncompressed BAI indexes, we instead use an ugly hack of keeping the last few bytes back until we want to close the file. This is horrible, but I can't think of a better way to get the result we want.

Finally, it turned out that calling bgzf_flush() when the file has been opened in uncompressed mode ("u") crashed due to a NULL pointer dereference. It now more usefully flushes the underlying file.

Fixes #1732

daviesrob avatar Feb 23 '24 08:02 daviesrob