hts-specs BAM v2 idea tracker

Given #40 has a new BAMv2 as an option, we should start thinking about what else may be suitable to add. Some ideas to get the ball rolling.

A minor version number. Both major and minor are forcibly checked against, so we don't get in the same pickle as before.
More than 65535 cigar operations, obviously!
Remove the bin field. It doesn't work for anything with large chromosomes anyway, and only makes sense for BAI indices. Best computed on the fly. Suggestion: this frees up the extra bits needed for cigar, with a little shuffling of flag fields.
Swap cigar and name fields. Right now we have all the fixed size fields together, followed by the variable size read name and the multiple of 4 sized cigar field. BAM was intended to be workable within C/C++ by simply loading it straight into memory and dereferencing, but some compiler optimisations generate SIMD instructions for the cigar fields and then crash as it's unaligned data. (In memory we resolve this by adding 1-3 nuls to the read name instead of just 1.) Swapping the two fields cures this problem.

What else?

Sep 12 '17 08:09 jkbonfield

We could consider adding .csi as the official index format for BAMv2 given that .bai doesn't support longer contigs.

Sep 12 '17 09:09 tfenne

Supporting 64 bit reference sequence lengths would useful even if it takes a while for the rest of the ecosystem to catch up.

Sep 12 '17 12:09 d-cameron

@tfenne I think indexing is a separate issue, given that .csi works for BAM, and we should be recommending it anyway due to the limitations in .bai.

@d-cameron Are there any reference sequences that get close to or beyond 4 Gbases? Switching to unsigned is easy (in fact HTSlib already reads l_ref as unsigned); increasing the size is a bit more work, albeit not that difficult.

I would also suggest changing some other fields from signed to unsigned where this makes sense.

Would anyone want to increase the header length to 64-bit? I think I've seen at least one case where someone made a header that did not fit in the current limit, but it was a while ago and I can't easily find it at the moment. It was an assembly with a large number of contigs that had fairly long names if I remember correctly.

Sep 12 '17 14:09 daviesrob

@daviesrob unsigned types are a real PITA for htsjdk. Java has fairly lame support for unsigned types, making them very hard to work with. From that side it's always preferable to just bump up to 64-bit signed and let compression do the job of squeezing out any empty bits.

I agree .csi indexing could be separate, but if we do a BAMv2 I think we should consider making that the standard index type instead of .bai.

Sep 12 '17 14:09 tfenne

Length of reference may be permitted to be unsigned, but POSition within a ref is still signed as it is 0-based with -1 as unmapped. Similarly TLEN doesn't work properly as has to be signed. So there is nothing to be gained by changing one field without changing all of them, hence >= 2Gb is the point to be considering. Does this happen? Not sure, but it seems likely (see discussion in https://www.biostars.org/p/12560/)

As far as compressing 64-bit values, I don't know the impact, but it's not completely free as gzip is pretty rubbish plus you have slightly fewer records per bgzf block. I expect it's only a minor impact though. Try it and see.

Sep 12 '17 15:09 jkbonfield

One of the main reasons to use unsigned types would be to disallow negative values which make no sense, so even if going to 64 bits I would still want the types to be unsigned. We can add a note to state that the maximum usable value is limited due to the use of signed values elsewhere. I'd also expect other implementation-defined limits, for example HTSlib can't quite do the full l_read_name at the moment for internal reasons.

Allowing 64 bit r_len raises a few other issues. pos, next_pos and tlen also get bigger. We may want to shuffle a few fields so that they are properly aligned to an 8-byte boundary so we can avoid holes in the in-memory structure. This would allow for writing the record in a single chunk, which is a useful property of the original format (although at least one of BAM and BAM2 would need to be broken in this respect if they want to use the same data structures internally).

Sep 12 '17 16:09 daviesrob

Are there any reference sequences that get close to or beyond 4 Gbases?

People tend to run into the 512Mb BAM index limit and chop up their reference so they can actually use programs.

Even moving from signed to unsigned doesn't give much wiggle room. With 100+Gbase genomes reported, there may well be real sequences greater that 4Gb so it would be good to be able to support them.

Sep 13 '17 00:09 d-cameron

I've had a go at implementing a BAM2 with 64-bit reference lengths. The results so far can be found in my bam2 branch if anyone wants to try it. Note that it's very much not for production - the implementation will change.

The impact on file size isn't too bad. On my test file (original 587575801 bytes), I got:

Uncompressed BAM: 2386240221 bytes Uncompressed BAM2: 2466117766 bytes (3% bigger) Compressed BAM: 587575801 bytes Compressed BAM2: 591009870 bytes (0.6% bigger)

While implementing it, I decided that a very useful addition would be a bam_extras field in the alignment records. This would allow for future enhancements in a more backwards-compatible way. It would be a set of flags that indicate extensions to the base BAM2 format. Writers should set unused bits to zero. If readers find a bit that they cannot interpret set, they should return an error. I think this would make it much less likely that we would need to increment the BAM version number again.

Sep 21 '17 14:09 daviesrob

Can we just remove the TLEN field? From what I can see, nobody trusts the value written by any other tool so, practically speaking, it doesn't serve any purpose.

Sep 21 '17 15:09 d-cameron

@d-cameron Getting rid of TLEN would be a separate issue. It exists in SAM, so it has to be in BAM as well.

We can get rid of bin because it's easy to calculate and doesn't actually appear in SAM.

Sep 21 '17 16:09 daviesrob

Bin also has he issue that it is inherently tied to the BAI index, so it leaks some of the index format into BAM format. This is an issue if we use CSI indices, which realistically we have to (or something equivalent) if we switch to longer chromosome lengths. Bin is basically dead and buried already as far as I'm concerned as it only works on Human data and smaller.

Sep 21 '17 16:09 jkbonfield

The fact that you might want to recalculate TLEN, doesn't mean that there shouldn't be a good place to store it....

On Thu, Sep 21, 2017 at 6:36 PM, Daniel Cameron [email protected] wrote:

Can we just remove the TLEN field? From what I can see, nobody trusts the value written by any other tool so, practically speaking, it doesn't serve any purpose.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/240#issuecomment-331195006, or mute the thread https://github.com/notifications/unsubscribe-auth/ACnk0ppIB_VvA0QErg_cXb_kE2x4-O-hks5skoJigaJpZM4PUSTX .

Sep 21 '17 18:09 yfarjoun

I've had a go at implementing a BAM2 with 64-bit reference lengths. The results so far can be found in my bam2 branch if anyone wants to try it.

Have you pushed your local changes to github? It seems that bam1_core_t still uses 32-bit pos. Moving to 64-bit will require quite a lot of changes, including the header, CIGAR, .csi and APIs.

Sep 21 '17 19:09 lh3

@lh3 The purpose of the experiment was to see how much bigger the file got when writing 64 bit values (not much). In the implementation I kept the size of the internal structures the same for ABI compatibility, but wrote out 64 bit values for pos etc. to the file. At some point in the future we can increase the size of the internal structures, but there's no particular urgency so we can do that at a time of our choosing (and bundle in lots of other ABI-breaking changes at the same time). The important thing is that the format would no longer be the limitation on what we can do.

In HTSlib, n_cigar was made 32-bit internally on our last ABI-breaking release (1.4). The CSIv1 specification includes example code that uses int64_t values for the indexed range. I think CSI should be fine, as long as min_shift and depth are chosen to keep the number of bins to within the size of an int32_t.

Sep 22 '17 11:09 daviesrob

@d-cameron I'd hate to lose TLEN. I know exactly where it comes from in my pipelines, am happy with it, and can't recompute it without co-locating read pairs which is a huge PITA in a non-queryname sorted BAM.

Sep 22 '17 13:09 tfenne

Agreed on TLEN. While it may be ill defined, within your own pipeline you can both define and use it in a robust manner. If it wasn't in its own column it'd need to be in an auxiliary tag instead. (Arguably better, but it's here already.)

Sep 22 '17 15:09 jkbonfield

I am thinking about a different way to implement BAMv2. It works on two conditions:

The BAM version string is put into an uncompressed BGZF block, such that it can be modified afterwards.
A newer BAM reader can always parse older BAMs without looking at the version string in the header.

In this design, samtools writes BAMv1 by default. If samtools is writing to a file and there are long cigars, it writes the long cigars with, for example, the 0x8000 flag (or other approaches). At the end of the process, samtools seeks back and updates the version string to v2. If samtools is writing to a stream, it prints a warning in the presence of long cigars. Conversely, if a user sets the output format to BAMv2 but samtools doesn't see any long cigars, it either updates the version string to v1 (if to a file) or gives a warning at the end (if to a stream). We also develop a tool that reads through a BAM, analyzes each record and sets the version string to the minimal version that is compatible with all records. The 2nd condition is necessary here.

This design reduces the chance of writing BAMv2 unnecessarily. Even if the BAM version in the header is not appropriate, we can fix it later without too much computation. These will help to reduce fragmentation.

The above is just a brainstorm idea with details missing, but I think it could work, in theory.

EDIT: if we go further along condition 2, we may abandon global versioning. We give each record a local version instead. The version can be encoded with block_size, following @daviesrob's idea. For example, positive for v1, -12 for v2, -13 for v3, etc. The encoding of the following block is fully determined by block_size and described in the spec. For each record, a BAM writer is required to choose the oldest encoding that is compatible with the record.

Local versioning is essentially a generalization of option 3 in the other thread. It avoids the complication of setting the right global version on the command line (e.g. users may choose to always output BAMv2 even if not necessary) and allows us to extend BAM frequently without unnecessary fragmentation – ultimately, it is unnecessary fragmentation that worries me most. The drawback is also obvious, though, in that we don't know if a tool is compatible with a BAM by just looking at the header. Condition 1 will help, but it is not perfect, either, and is tricky to implement.

Sep 26 '17 04:09 lh3

how about: have a designated place to store the hd5sum of the index in the bam (e.g. in the final block) so that we don't need to trust timestamps.

Nov 30 '24 03:11 yfarjoun

I suspect too many things rely on the EOF block being a very precise set of bytes, however there's no reason we couldn't add a separate empty bgzf block with additional data in it somewhere if we can figure out how to do it without breaking the BGZF spec. Unfortunately that's quite prescriptive on the extra data contents. I guess we could do something sneaky such as have the penultimate BGZF block also being empty (but not matching EOF) and have the gzip MTIME field as the index CRC.

That would be backwards compatible with BAMv1, so if found it can be checked, if not it'd be silently glossed over by decoders.

Dec 02 '24 09:12 jkbonfield

Thinking again on this, one solution would be to encode a CRC32 (it's good enough for detecting accidents, but not enough for security obviously) in the BAM. We could try generating a block of data that has the same CRC32 in the gzip block as the BAM index (see https://www.nayuki.io/page/forcing-a-files-crc-to-any-value), but that means our block has actual data in it and we want a zero sized block, post decode.

Alternatively we can encode it in the huffman tree. Basically our huffman tree has bit-code lengths for all the literal+distance symbols in the byte stream. We can add symbols that don't actually occur, so we can store many huffman codes while having nothing to decode. I validated this is fine with zlib. Then it comes down to, what do we store? The easy solution is if we have 32-bits of checksum to encode, then that's 32 x 1-bit quantities. Each 1-bit quantity is basically "is the next symbol present". Ie ABCFGI is ABC..FG.H where . is absent symbols, so (0)00110010. Basically we can store any 32-bit quantity with at-most an alphabet spanning 64 literal symbols. Or trivially up to 146 bits of meta-data in total via the DEFLATE huffman tree. If we want to store the code lengths than it can be even more, but we'd need to be wary of implementations that actually form a tree so the lengths must make sense. Anyway, more than enough for our purpose. I'd suggest 8-bits of fake block ID and then N-bits of fake block contents, for future flexibility.

It'd be trivial to have a way to encode fake data like this. It would be completely transparent to any older BAM decoder, while permitting newer encoders and decoders to do additional validation. The question is - should we? It's only going to work for SAM.gz, BAM, VCF,gz or BCF as CRAM would need its own mechanism due to no BGZF. Arguably it's best done by a header, but that's something we may not know upfront if we're automatically indexing as we go.

A better solution may be to exploit a shared piece of random knowledge. Eg pick a random value and encode in both BAM and BAI. If they match then they (probably) belong together. This is verifiable without checksumming entire files, plus it's something we know at generation so we can use it before finishing the file. Ie it could be in the header if we wanted. We still need a way of storing arbitrary additional data. For SAM / VCF it's obviously possible in the headers with ease. For the index it's once again a bit tricky - CRAI lacks meta-data, and BAI/CSI/TBI all have nuances although there are meta-data bins for such purposes perhaps.

Jan 06 '25 15:01 jkbonfield