htslib
htslib copied to clipboard
Add method sam_open_write
-
The new method
sam_open_write
can open ahtsFile
file in writing mode only, attach a header struct (sam_hdr_t
) to it and write the header data to the disk, before any alignment data. -
New getter method
hts_get_fn
is used to fetch the internal file name ofhtsFile
. -
The index file name inside
htsFile
is now managed by HTSlib, and no longer points to an application variable.
If we change fnidx
to be managed by HTSlib (which I think we can get away with, as I haven't found anything that tries to fiddle with it), we should also change the documentation for sam_idx_init()
as it's no longer necessary for the pointer to remain valid.
Looks good. Maybe squash the 1st and 3rd commits together?
Looks good. Maybe squash the 1st and 3rd commits together?
I squashed all three of them, as the second didn't make much sense by itself.
@jmarshall so... were there any comments....?
To continue the conversation from the previous PR #1052,
It could be argued that the samtools API was better. I've never entirely seen the point of having to pass in the header to sam_read1() and sam_write1()
When there are lots of different sets of headers around, e.g., when implementing samtools merge
, I think the overall code is clearer when it is explicit which headers go with which bam1_t
records and which files. (e.g., user code will use hdr->target_name[mybam->core.tid]
to get their bam1_t
's RNAME string and needs to be clear which hdr
to use with that mybam
— which is clearer if the right hdr
also appears in nearby sam_read1
/sam_write1
calls.)
Having said that, I can see the point of attaching the headers to the htsFile
and having the option of not passing them explicitly to sam_read1()
and sam_write1()
. So FWIW I would support adding this sam_write1(fp, NULL, b)
mode of operation so that user code can choose which style to use.
There are two ways of opening a file for use with sam_read1()
/sam_write1()
: hts_open()
/hts_open_format()
and hts_hopen()
. In the latter case, this might be via hopen("filename") … hts_hopen()
or hdopen(arbitrary_fd) … hts_hopen()
.
The proposed sam_open_write()
limits you to directly providing the filename à la hts_open()
. This means it's badly factored w.r.t. the rest of the API, as you can't use it with an hFILE*
that you already have. The new function also in the end doesn't buy you much: it just bundles the sam_hdr_write
and attaching of the headers into the single opening function.
I would suggest instead of adding this sam_open_write()
, instead make sam_hdr_write()
itself attach the headers to the file pointer. Then new-style user code could be:
samFile *fout = hts_open_format(filename, "w", &myfmt); // or hts_hopen etc
if (fout == NULL) { "error can't open: permission problem etc" }
if (sam_hdr_write(fout, headers) < 0) { "error headers borked or I/O error" }
…
…sam_write1(fout, NULL, b)…
which keeps the handling for the distinct failure modes separate and is still only two function calls (compared to the proposed sam_open_write
's one function call).
Due to the header reference counting (which needs to be verified as still correct with either form of more widespread header attaching), having sam_hdr_write()
implicitly attach the headers to the file pointer should have no adverse effect on old-style code that specifies the headers explicitly in sam_write1()
.
(If OTOH you do go with sam_open_write()
, I have additional comments about error handling. hts_open_format
reports operating system failures via errno
, so the other error paths need to set errno
too.)
On the whole I think I prefer the simplicity of only having to call one function. It is true that it lacks features compared to hts_open
etc., but then the hts_open
interfaces also aren't completely consistent - for example, there's no hts_dopen
which can lead to quite a dance when you want to write a bam file into one. You also can't pass an htsFormat
struct to hts_hopen
and you can't pass varargs into hts_open
- thinking of which, maybe we should allow varargs in sam_open_write()
which could be passed on when creating the hFILE
(would require a vhopen()
)?
We could always add sam_hopen_write()
and sam_dopen_write()
later to complete the set, and simplify the sam file into file descriptor problem.