htslib icon indicating copy to clipboard operation
htslib copied to clipboard

bam1_t build function

Open jkbonfield opened this issue 5 years ago • 13 comments

See Will Stokes' request at https://sourceforge.net/p/samtools/mailman/message/36454348/

We desire a way to construct a BAM record from the component pieces that isn't just a manual mess of bit/byte shuffling. Most of it is easy, but some things such as creation of the sequence is complex.

CRAM has an internal function for this (bam_construct_seq), so looking at that and surrounding functions may be a suitable starting point.

jkbonfield avatar Oct 30 '18 17:10 jkbonfield

Is this still relevant? If yes, I might give it a try.

anderskaplan avatar Sep 18 '20 13:09 anderskaplan

Yes, it would still be useful, thanks. As James mentioned, bam_construct_seq could be used as a starting point, although you'd probably want to pass in a bam1_t pointer directly and the extra_len bit really should be automatic.

A separate interface for bulk adding aux tags would also be useful, although I've no idea what it should look like.

daviesrob avatar Sep 18 '20 15:09 daviesrob

As Rob points out it probably needs something sorting regarding auxiliary data. The way this is used in CRAM is the aux data is stitched together separately and has some special handling for the read-group (which is stored as an integer field in CRAM), but as it does know the length of the incoming aux it passes over the length in order to avoid a second realloc. See the call here: https://github.com/samtools/htslib/blob/80161279b755d83df6e0dbc52942f625df414452/cram/cram_decode.c#L2899

I think it probably should take a uint8_t *preformatted_aux and size_t aux_len fields which may be passed in as NULL/0 if you don't wish to use them. Constructing aux is a different set of APIs maybe? Either that or we additionally have a varargs style interface for specifying key/value pairs. There are maybe some ideas around that in the SAM header API, eg: https://github.com/samtools/htslib/blob/develop/header.c#L1316

jkbonfield avatar Sep 18 '20 15:09 jkbonfield

All right, thanks for the pointers! I'll dig into it and probably come back with more questions.

anderskaplan avatar Sep 20 '20 19:09 anderskaplan

Hi again, I've done some research and come up with two alternatives and a few questions. I'd appreciate your input to be able to move forward!

The function we want has two non-trivial jobs to do. The first is to manage the bam1_t data buffer, making sure that it gets reallocated as necessary and writing the dynamic length fields into it. The second is to convert the sequence string from a char array to a packed integer format.

The bam_construct_seq() function in cram_samtools.c seems to do what is needed: it sets all the fields in the bam1_t struct except the aux fields. And the aux fields can be added using bam_aux_append(). That seems to me like a reasonable way to do it. Performance-wise it should be OK since the aux fields are at the end of the bam data buffer, i.e., nothing needs to be moved around. One could also create an API to parse an aux string and append all the aux fields to the bam struct, much like sam_parse1() does. I think that would be best as a separate function.

However, as has been pointed out, bam_construct_seq() is not in the public API, and its argument list seems to be tailor made for its only caller, cram_to_bam(). So just moving it into the public API is probably not the best option. Instead, I suggest these two alternatives:

Alternative 1. A function which writes all the fields of the bam1_t struct. Unfortunately this means that the argument list becomes quite extensive:

int bam_construct(
      hts_pos_t  pos,      // 0-based leftmost coordinate
        int32_t  tid,      // chromosome ID, defined by sam_hdr_t
        uint8_t  mapq,     // mapping quality
       uint16_t  flag,     // bitwise flag
         size_t  l_qname,  // length of the query name
    const char  *qname,    // query name
         size_t  n_cigar,  // number of CIGAR operations
const uint32_t  *cigar,    // CIGAR data
         size_t  l_seq,    // length of the query sequence (read)
    const char  *seq,      // sequence
    const char  *qual,     // sequence quality
        int32_t  mtid,     // chromosome ID of next read in template, defined by sam_hdr_t
      hts_pos_t  mpos,     // 0-based leftmost coordinate of next read in template
      hts_pos_t  isize     // observed template length ("insert size")
);

Note:

  • The names of the fields differ between the SAM format spec, the bam1_t declaration, and the bam_construct_seq() parameter list. Here I have tried to conform to the bam1_t declaration, except for mapq/qual which are from the SAM format spec. Any preferences about the naming?
  • In the bam1_t, the length fields are declared like this: uint16_t l_qname, uint32_t n_cigar, int32_t l_qseq. The sign of l_qseq does not seem to be used anywhere. My suggestion is to use size_t's to signal intended use, i.e., that these are array lengths, instead of storage type. Any preferences?
  • What about the id field in the bam1_t struct, should it be an input parameter as well? It's not documented and I cannot see any references to it within htslib, except that it is copied by bam_copy1, but it might of course be used elsewhere.
  • The special formatting of the parameter list is only to make it easier to read here; the code will use standard formatting and doxygen-style comments.

Alternative 2. A simplified function to set only the variable-length fields, since the regular fields are easily set directly on the struct:

int bam_assign(
         size_t  l_qname,  // length of the query name
    const char  *qname,    // query name
         size_t  n_cigar,  // number of CIGAR operations
const uint32_t  *cigar,    // CIGAR data
         size_t  l_seq,    // length of the query sequence (read)
    const char  *seq,      // sequence
    const char  *qual      // sequence quality
);

The questions about naming and type preferences above apply here as well.

Alternative 3. A combination of alternatives 1 and 2, because each might have its use. The first would then call on the second.

anderskaplan avatar Oct 10 '20 20:10 anderskaplan

That's an interesting philosophical point. Personally I wouldn't have started from having a user visible bam struct like this (even if it's technically visible, mandate that we use get and set macros or inline functions to provide us freedom to change things). Hence my instinct is the more complete first way is cleaner. (You have some valid questions on the first method, but I'll delay thinking about those too hard until we know which strategy we prefer.)

However we are where we are and use of bam1_t->core struct is endemic throughout ours and others code. That's not going to change short of a full "htslib 2.0" so the second function is more in keeping with htslib design. For completeness I'd be inclined to add size_t l_aux and uint8_t *aux parameteres which add an entire aux-encoded block of data. They can be passed in as 0, NULL perfectly easily, but if we already happen to have data in that form then we could supply it.

Alternatives may be to specify the arguments as kstring_t, but that's personal preference. I think I prefer it as-is as it's not hard to call it with seq.l, seq.s if you have a kstring_t seq.

Other maintainers wish to add their tuppence worth on this one?

jkbonfield avatar Oct 12 '20 08:10 jkbonfield

I'd say alternative 1, on the grounds that you're most likely to want to set everything - although you'd probably want to include a pointer to the bam1_t struct to fill out :smile:. Const pointers and separate lengths are fine, especially as it enforces the rule that seq and qual are the same length.

I'm not so sure about the utility of passing in pre-formatted aux data though. It's not a trivial thing to get right - you have to use the correct type encodings, deal with endianness and unaligned access. We already have APIs to add aux records without having to worry about such internal details, although it might be useful to have another function that can add them more efficiently in bulk, and maybe something that allowed them to be copied from another BAM record. It might be useful to be able to reserve space for them in the bam_construct() function though, so you can avoid having to do some memory reallocation later.

daviesrob avatar Oct 12 '20 09:10 daviesrob

maybe something that allowed them to be copied from another BAM record.

That was my thought for why it should be included. If we're constructing a new BAM record from the relics of an old one (maybe some major reshuffling, such as turning a soft-clip into a hard-clip) then it's trivial to get hold of a preformatted block of aux tags. Obviously I'd encourage use of the existing functions to construct the tags for other cases.

It might be useful to be able to reserve space for them in the bam_construct() function though, so you can avoid having to do some memory reallocation later.

Preallocation is sort of done by the CRAM version for the same reasons of avoiding expensive reallocs (although it's also doing a bit more there as it's also explicitly declaring that space as used by tags so you must then fill it).

jkbonfield avatar Oct 12 '20 09:10 jkbonfield

Thanks for the clarifications! Summing up what's been said so far, it seems that this is the way we're heading:

int bam_construct(
         bam1_t**bam,      // target BAM struct
         size_t  l_qname,  // length of the query name
    const char  *qname,    // query name
       uint16_t  flag,     // bitwise flag
        int32_t  tid,      // chromosome ID, defined by sam_hdr_t (a.k.a. RNAME)
      hts_pos_t  pos,      // 0-based leftmost coordinate
        uint8_t  mapq,     // mapping quality
         size_t  n_cigar,  // number of CIGAR operations
const uint32_t  *cigar,    // CIGAR data
        int32_t  mtid,     // chromosome ID of next read in template, defined by sam_hdr_t (a.k.a. RNEXT)
      hts_pos_t  mpos,     // 0-based leftmost coordinate of next read in template (a.k.a. PNEXT)
      hts_pos_t  isize,    // observed template length ("insert size") (a.k.a. TLEN)
         size_t  l_seq,    // length of the query sequence (read)
    const char  *seq,      // sequence
    const char  *qual,     // sequence quality
         size_t  l_aux,    // length of the auxiliary field data (optional)
  const uint8_t *aux       // auxiliary field data (optional)
);

The intention was to use the same pre-allocation/automatic re-allocation mechanism as the CRAM version does. Hence the use of a double pointer to the BAM record. But as you mentioned, that part of the interface was lost in translation. Oops. 😄

If one wants to copy the aux data from another BAM record, it is straightforward to use bam_get_l_aux(b) and bam_get_aux(b) for the l_aux and aux parameters. Do you see any potential problems with that -- particularly concerning the type encodings, endianness and unaligned access issues mentioned above?

Another alternative could be to create a separate function to copy the aux data from one BAM record to another. If that is the preferred route, then we might replace the l_aux/aux parameters on bam_construct() with an optional "reserved aux size" parameter to ensure that the BAM record is large enough to hold an aux block with the given size.

Here I have used the same field ordering as in the SAM specification as far as possible. A logical order is always helpful, and if we want to consider the structure of the BAM record as an internal implementation detail, then I think it's better to go with the public spec.

anderskaplan avatar Oct 12 '20 18:10 anderskaplan

Yes, I think it's roughly the right direction. I'd use a single pointer to the bam1_t struct. The double one is likely a relic of the ancestor of bam_construct_seq in io_lib. Io_lib stores the fixed and variable length parts of the BAM record in a single structure, so it may need to change the location of the whole thing when resizing. HTSlib on the other hand stores the variable-length parts separately so it should not need to change the location of the bam1_t struct.

I'd also prefer to have the aux block manipulation done separately, so we can have various ways of building them without necessarily showing a preference for any single method. Being able to reserve some space would be useful, though - it should be fairly easy to guess how much you might need in quite a few cases, and would save having to reallocate later if you get it approximately right.

daviesrob avatar Oct 12 '20 19:10 daviesrob

I've just been needing this for a "samtools fastq" revision. For now I copied the cram one into the source with a few tweaks as a place-holder, but it was enlightening as it showed me one potential change.

Should we permit qual to be string as well as binary? Maybe by just specifying a starting offset or delta value (ie 33 for string to binary). The usage is obvious. When constructing a BAM record from a BC/QT tag (seq/qual) we have these strings already in memory and it's a bit of a pain to have to create a temporary string just to convert from ASCII to binary before passing it into this function.

Although the function also handles qual of NULL in which case it fills out value 0xFF (which comes out as "*" in SAM). Ideally this could be configurable so we can use a fixed qual instead.

Note both of these are solved by using a local temp buffer to build up qual first, so it's no biggy and one without the other doesn't help my use case.

Thoughts?

PS. Yes Rob, I also needed the ability to pass over an entire block of pre-formatted aux tags as a single argument! Although as it wasn't in the func I copied I did indeed do it in two calls instead, but it did need the extra_len flag present still.

jkbonfield avatar Oct 16 '20 16:10 jkbonfield

I've got this working on my machine now, so I'll have a pull request ready soon.

Given that the poor function already has 16 (!) parameters, I'd recommend against adding another one for the qual offset/delta value. An option could be to pass in NULL for the qual array, which will fill the qual buffer in bam->l_data with 0xFF. Then as a second step you could copy the qual data minus the offset directly into the qual buffer.

anderskaplan avatar Oct 16 '20 21:10 anderskaplan

Given that the poor function already has 16 (!) parameters, I'd recommend against adding another one for the qual offset/delta value. An option could be to pass in NULL for the qual array, which will fill the qual buffer in bam->l_data with 0xFF. Then as a second step you could copy the qual data minus the offset directly into the qual buffer.

Good point. That's still less work than using a temporary buffer to construct the quality buffer to pass in. Thanks for your work here.

jkbonfield avatar Oct 19 '20 08:10 jkbonfield