htslib icon indicating copy to clipboard operation
htslib copied to clipboard

Add method sam_open_write

Open valeriuo opened this issue 4 years ago • 6 comments

  1. The new method sam_open_write can open a htsFile file in writing mode only, attach a header struct (sam_hdr_t) to it and write the header data to the disk, before any alignment data.

  2. New getter method hts_get_fn is used to fetch the internal file name of htsFile.

  3. The index file name inside htsFile is now managed by HTSlib, and no longer points to an application variable.

valeriuo avatar Apr 06 '20 13:04 valeriuo

If we change fnidx to be managed by HTSlib (which I think we can get away with, as I haven't found anything that tries to fiddle with it), we should also change the documentation for sam_idx_init() as it's no longer necessary for the pointer to remain valid.

daviesrob avatar Apr 14 '20 15:04 daviesrob

Looks good. Maybe squash the 1st and 3rd commits together?

daviesrob avatar Apr 21 '20 08:04 daviesrob

Looks good. Maybe squash the 1st and 3rd commits together?

I squashed all three of them, as the second didn't make much sense by itself.

valeriuo avatar Apr 21 '20 12:04 valeriuo

@jmarshall so... were there any comments....?

daviesrob avatar Apr 22 '20 16:04 daviesrob

To continue the conversation from the previous PR #1052,

It could be argued that the samtools API was better. I've never entirely seen the point of having to pass in the header to sam_read1() and sam_write1()

When there are lots of different sets of headers around, e.g., when implementing samtools merge, I think the overall code is clearer when it is explicit which headers go with which bam1_t records and which files. (e.g., user code will use hdr->target_name[mybam->core.tid] to get their bam1_t's RNAME string and needs to be clear which hdr to use with that mybam — which is clearer if the right hdr also appears in nearby sam_read1/sam_write1 calls.)

Having said that, I can see the point of attaching the headers to the htsFile and having the option of not passing them explicitly to sam_read1() and sam_write1(). So FWIW I would support adding this sam_write1(fp, NULL, b) mode of operation so that user code can choose which style to use.


There are two ways of opening a file for use with sam_read1()/sam_write1(): hts_open()/hts_open_format() and hts_hopen(). In the latter case, this might be via hopen("filename") … hts_hopen() or hdopen(arbitrary_fd) … hts_hopen().

The proposed sam_open_write() limits you to directly providing the filename à la hts_open(). This means it's badly factored w.r.t. the rest of the API, as you can't use it with an hFILE* that you already have. The new function also in the end doesn't buy you much: it just bundles the sam_hdr_write and attaching of the headers into the single opening function.

I would suggest instead of adding this sam_open_write(), instead make sam_hdr_write() itself attach the headers to the file pointer. Then new-style user code could be:

samFile *fout = hts_open_format(filename, "w", &myfmt); // or hts_hopen etc
if (fout == NULL) { "error can't open: permission problem etc" }
if (sam_hdr_write(fout, headers) < 0) { "error headers borked or I/O error" }
…
…sam_write1(fout, NULL, b)…

which keeps the handling for the distinct failure modes separate and is still only two function calls (compared to the proposed sam_open_write's one function call).

Due to the header reference counting (which needs to be verified as still correct with either form of more widespread header attaching), having sam_hdr_write() implicitly attach the headers to the file pointer should have no adverse effect on old-style code that specifies the headers explicitly in sam_write1().

(If OTOH you do go with sam_open_write(), I have additional comments about error handling. hts_open_format reports operating system failures via errno, so the other error paths need to set errno too.)

jmarshall avatar Apr 24 '20 09:04 jmarshall

On the whole I think I prefer the simplicity of only having to call one function. It is true that it lacks features compared to hts_open etc., but then the hts_open interfaces also aren't completely consistent - for example, there's no hts_dopen which can lead to quite a dance when you want to write a bam file into one. You also can't pass an htsFormat struct to hts_hopen and you can't pass varargs into hts_open - thinking of which, maybe we should allow varargs in sam_open_write() which could be passed on when creating the hFILE (would require a vhopen())?

We could always add sam_hopen_write() and sam_dopen_write() later to complete the set, and simplify the sam file into file descriptor problem.

daviesrob avatar Apr 27 '20 15:04 daviesrob