CRAM v4 idea tracker
CRAM v4 isn't on the cards yet, but it's good to keep track of any ideas in one place so if it ever happens we don't forget something.
- Index format improvements: https://github.com/samtools/hts-specs/issues/137
- Removal of CORE block?
- Cull some of the codecs (mainly those to do with core, eg beta, subexp, etc)
- Additional codecs, if not already adding by then in a 3.x sub-release. Eg FSE, Huff0 at the fast end, ZSTD in the middle and maybe fqzcomp for qualities and custom name-LZ at the slower end (see io_lib cram_modules branch).
- Read names present/absent via another route, eg zero length read name => generate. Currently it is done via a flag but this also forces detached mode and has other implications.
- Possibility of transforms prior to codec.
- Eg nibble packing (put two qualities into one byte before applying rans0, rans1, huff0, etc).
- String delta (prefix / suffix removal).
- Ability to store confidence values in original orientation. This improves order-1 compression ratio for qualities and is very quick to achieve.
- Support for multiple embedded references per slice?
- Permit rANS table to be in compression header and maybe implement an order-2 codec? It'll be much slower due to memory usage, but potentially with say 100 1k read slices per container we get size efficiency coupled to strong random access capability.
- Potential for conditional decoding with potential for duplicated entries. Eg h37 & h38.
- Each read has a bit field to indicate under which reference the data should be emitted (both for most cases).
- Possibility of bit field per data series too? Eg cigar may change, but quality string doesn't.
- How to get working with read-pairing (read number delta)?
The first 4 items there could be done via a point release of 3 surely?
I'm not so sure on index format changes or removal of features, but new codecs for sure.
Index format is independent of the file format, ideally it should have it's own versions.
New codecs are fine in v3.x as long as they are well justified.
Also there is the matter of priorities, for example embedded multirefs may have higher impact.
Additional idea based on observation from RNAseq data.
Employ a special code to mean "same entry as last record". In particular we often see data (eg tophat) where, say, fwd read is aligned in two locations and rev read only matches one site. In order to permit valid PNEXT/RNEXT/TLEN fields the same record is duplicated twice in a row - once per fwd read.
A simple test of a block of quality values shows around 4% saving simply by embedding "duplicate last line" code, while keeping speed very high as it's trivial to implement. The same logic may apply to other fields, such as read-names, cigar strings, etc.
More things to consider
- Support for verbatim CIGAR ops, permitting zero sized ops and distinguishing between =, X and M.
- NM / MD present bit field. This permits recreating these only for records that orgiinally had them and not for every record (or none).
- Some way of separating RNEXT, PNEXT and TLEN from each other. Right now they're all computed on-the-fly or all stored verbatim. We often find RNEXT and PNEXT correct and TLEN out by one, but then pay the cost of storing RNEXT and PNEXT verbatim. We also don't have a way of storing * 0 0 for these fields without verbatim storing them and marking the read as detached (which has implications elsewhere).
Exploratory work based on a small chunk of Illumina NovaSeq data:
CRAM v3, 10k reads/slice: 23.5Mb, 5.9s encode, 1.6s decode CRAM v4, 10k reads/slice: 19.7Mb, 7.9s encode, 2.0s decode
This is entirely down to two new codecs - a dedicated read name encoder (around 40% smaller than gzip and 30% smaller than lzma) and a rANS variant with bit-packing and RLE built-in. There were few tags in this data (it's from Isaac) so the names + qualities dominate. On older bwa data with 40 qualities I was looking at only 4% or so saving (vs 16% on this data). This is slightly slower, but note it is faster (at both encode and decode) and small than using either bzip2 or lzma so it offers us a better position on the pareto frontier for the size vs time tradeoff.
The rANS variant implementation poses an interesting question. How do we handle this in a general purpose manner?
Currently it's just a hack - the "order" byte (0 vs 1) now has two additional bits to indicate whether RLE was performed or whether bit-packing was present. Eg on a single block of NovaSeq quals:
268.4 MB/s enc, 505.4 MB/s dec 15100000 bytes -> 713819 bytes (r4x16pr -o193)
182.6 MB/s enc, 318.9 MB/s dec 15100000 bytes -> 768045 bytes (curr CRAMv3; rans4c -o1)
24.5 MB/s enc, 33.8 MB/s dec 15100000 bytes -> 676298 bytes (bsc -tTp -e2 -m0)
(Bsc included for comparison with best reasonable-time general purpose tool, but it's still an order of magnitude more CPU. It perhaps indicates adaptive order-2 or higher coding is worth doing though.)
RLE is obvious, although some implementations are better than others. My version is based on mespotine method with a lookup table to indicate which symbols gain through RLE and which do not. I encode run-lengths and run-symbols as two distinct data sets, rans compressed and concatenated. In this regard it's perhaps similar to BYTE_ARRAY_LEN which has two encoders for lengths and bytes. So should RLE itself be a separate codec which can be combined with anything? That feels nice and generic, but may be slower (untested).
Bit-packing is basically applying a map, eg ACGT -> values 0,1,2,3 and packing 4 symbols into a byte with 2 bits each. It handles 0, 1, 2, 4 or 8 bits per symbol, with 0 being a special case to indicate "all the same value".
Again bit-packing could be its own codec with a single encoder for the output - like BYTE_ARRAY_STOP has one part meta data and one part encoder.
Then we could nest them: eg BIT_PACK(4, {"A",0,"C",1,"G",2,"T",3}, RLE(EXTERNAL(10), EXTERNAL(11)) meaning map ACGT to 0123, pack nibbles together, run-length the result and store run lengths in block 10 and run literals in block 11.
Also of interest is a tweak to rans to indicate the N-way multiplexing of coders. Currently this is hard-coded at 4, but see the speed of implementations here which vary from 350Mb/s to 1400Mb/s decodes: https://github.com/jkbonfield/rans_static
A 32-way AVX2 accelerated rans can really shift, but it's inappropriate for most data sets as it has 8x as much overhead when flushing (128 bytes instead of 16, although there are smarter ways of flushing that may reduce it to maybe 70-80 bytes). We could just have multiple codecs with differing levels of unrolling, but more general is an N-way parameter with fixed allowable values (eg 1, 4 or 32 say) so we can pick the appropriate multiplexing for the size of the block being compressed.
We don't gain speed trying to massively unroll small data blocks but pay a price. Fundamentally the algorithm is the same in all cases and can be done in a single naive loop (see rNx16b.c), but permitting smarter SIMD methods if we have appropriate hardware.
Adding to the ever growing idea list...
MAP function: converting listed integer values to codes 0, 1, 2, 3, ...
Eg for BF field we may currently define it as:
BF => EXTERNAL {15}
But an analysis shows just 23 values used in an example file:
$ samtools view ~/scratch/data/9827_2#49.cram|head -10000|awk '{print $2}'|sort|uniq -c|sort -nr
2418 163
2415 83
2361 99
2358 147
135 121
133 73
31 185
19 137
16 65
14 81
14 129
14 113
12 145
9 97
8 161
8 1187
8 1107
7 177
6 1171
6 1123
5 1161
2 1145
1 1097
Mapping those values to integers in descending frequency order permits better encoding. Currently HUFFMAN, BETA, GAMMA etc all write to CORE block, but could in theory bit-stream out to an external block instead. However we can just use EXTERNAL encoding once mapped and chuck it through a different entropy encoder. Eg in cram_dump output I'd expect to see something like:
BF => MAP{23, {163,83,99,147,121,73,185,137,65,81,129,113,145,97,161,1187,1107,177,1171,1123,1161,1145,1097}, EXTERNAL {15}}
The effect of this is to pass all the ITF8 encoding inefficiencies from the external block to the compression header. Every time we emit 163 via EXTERNAL codec we are writing out two ITF8 bytes (0x80 0xa3) instead of one byte (0x00 after map), similarly for 147 (0x80 0x93 -> 0x3). This is an immediate size reduction, which can then also be further reduced by entropy encoders. A quick demo:
$ perl -e '%x=(163=>0,83=>1,99=>2,147=>3,121=>4,73=>5,185=>6,137=>7,65=>8,81=>9,129=>10,113=>11,145=>12,97=>13,161=>14,1187=>15,1107=>16,177=>17,1171=>18,1123=>19,1161=>20,1145=>21,1097=>22); while (<>) {chomp($_);print chr($x{$_})}' _BF_ |entropy16
len = 10000 bytes, entropy16 = 2867.470884, 2.293977 bits per byte
len = 10000 bytes, entropy8 = 2890.973451, 2.312779 bits per byte
len = 10000 bytes, entropy8o1 = 2861.055669, 2.288845 bits per byte
$ perl -e 'while (<>) {if ($_ >= 128) {print chr(0x80+($_>>8)),chr
($_&0xff)} else {print chr($_)}}' _BF_ | entropy16
len = 14903 bytes, entropy16 = 3716.031005, 1.994783 bits per byte
len = 14903 bytes, entropy8 = 4592.193830, 2.465111 bits per byte
len = 14903 bytes, entropy8o1 = 2879.035860, 1.545480 bits per byte
In practice it's easy to achieve good entropy encoding for order-0 after mapping (we get 2946) and it's hard to overcome the order-1 frequency table cost when trying order-1 encoding of the itf8 values (we get 3224). So this tiny one-block demo is 8.6% saving of BF values, although they're a tiny portion of the overall file size. It's work on CF too, maybe some auxiliary tags, but probably little else. Therefore I don't have strong hopes for this one, but documenting it anyway just in case.
My thought is this also makes it easier to implement basic algorithms that work on byte streams at the external block level (eg RLE) that don't need to be more complex in order to handle streams of integers of unknown sizes. We can't tell if this is likely to be significant until it's tried out in practice.
One more thought - ITF8 sucks for signed values! Eg reference ID is -2 (multi-ref via RI data series), -1 (unmapped) or >= 0 (reference N). -1 has to be stored of 0xff 0xff 0xff 0xff 0x0f. The final 0x0f ruins RLE encoding too (maybe that's why the original Java code set the unused bits to 1 instead of 0?).
A better solution, for the data series that have signed values, is to have an SITF8 encoding where the sign bit is moved from the top bit to the bottom bit, or zig-zag encoding (eg (n << 1) ^ (n >> 31) encode and (n >> 1) ^ -(n & 1) decode) to avoid having both +0 and -0 possibilities. This makes encoding of small negative numbers as efficient as encoding small positive numbers, instead of making all negatives take up 5 bytes. Note it's still beneficial to use the unsigned variant for something which can only be unsigned (eg number of records, length of span, etc).
The v3 spec reserves 20 bytes in the header for a cramID, and suggests populating it with the filename as one possibility. The htsjdk implementation places the full pathname there (well, the first 20 bytes...). This causes the md5 of the file to vary depending on it's location when it was created. The htsjdk implementation can be changed, but we might want to also change the spec to be more prescriptive about what goes in here.
I think the purpose was sha1 digest of whatever, like the path.
Vadim
On Thu, 11 Jan 2018, 11:07 Chris Norman, [email protected] wrote:
The v3 spec reserves 20 bytes in the header for a cramID https://github.com/samtools/hts-specs/blob/master/CRAMv3.tex#L402, and suggests populating it with the filename as one possibility. The htsjdk implementation places the full pathname there (well, the first 20 bytes...). This causes the md5 of the file to vary depending on it's location when it was created. The htsjdk implementation can be changed, but we might want to also change the spec to be more prescriptive about what goes in here.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/samtools/hts-specs/issues/144#issuecomment-356974951, or mute the thread https://github.com/notifications/unsubscribe-auth/AAoz5HUbcIEcp1NvoJlVtgdAxUnXfUBkks5tJjHHgaJpZM4IFXtD .
Sorry for the bump. Other random ideas that came up recently.
-
Reference compression in sam_comp builds a model per locus and compares each read against this model instead of against the reference. This is dynamic and progresses as each successive read is added. Each locus has probability of A, C, G, T and *.
-
Deez uses a similar idea but as a static model rather than as an adaptive model and with one call only (the "consensus"). Reads use the consensus to delta against and then the consensus is delta vs actual reference. Embedded ref in CRAM can already be used to do the former part, but not the latter part.
Merging both of these existing ideas together could be to have multiple embedded references per slice representing observed haplotypes and diffing reads aginst their closest haplotype. It's not a full statistical model, but does have the benefit of handling correlations. We'd need to use one of the standard and well tested clustering / compression approaches for describing how the multiple haplotypes compare against one another, with one canonical one (or their consensus) then comparing vs the reference.
This also has parallels with handling graph genomes, as that is simply a different approach for haplotype description. Reads are diffed against a graph node and graph nodes themselves are anchored on the reference.
SlimGene uses reference based compression and, like CRAM, stores successive deltas for the position of each edit in the sequence. Unlike CRAM the positions are stored in order from 3' to 5' under the assumption that most short read data is high quality at the start and errors cluster towards the end. Ie the magnitude of the first_edit_pos is usually larger than the magnitude of (len - last_edit_pos).
I am unsure of how much difference this actually makes, but CRAM4 a protoype on NovaSeq data has the FP field consuming 6.3% of the total storage cost. Needs experimentation!
NM / MD / RG tags are generated on-the-fly with CRAM decode. RG is via a numerical data series, and NM / MD generated via the features describing the differences vs the reference. This changes the order of tags (not defined anyway), but also for NM / MD the very notion of whether the tag is present is lost. That is, we either generate all of them or none of them, but cannot record which we started with and we have to punt this decision to the user at decode time.
The TL (tag list) data series is a lookup into the TD (tag dictionary) to indicate which list of tags were present and in which BAM encoding format, so SAZ, X0c, etc. We could, and for referenceless encoding do, add NMi (likely NMc as it'll be a small integer) and MDZ to that list and store verbatim. Scramble has a mode to do this already.
Proposal: store NM*, MD*, RG* in that list with "*" type being interpeted as "generate me". This is an indication that the tag is present, and where it is in the list if we really care about that, but also that the data is not stored so we don't need it to consume space.
This then permits us to know which records have NM and which do not, and to generate NM if and only if the original SAM/BAM file had NM too. We're putting things back to how they should be - generation of NM/MD is an explicit e.g. "samtools calmd" command and not an implicit part of the file format.
An aside: this also solves the issue of using a non-reference embedded ref, eg consensus. We know which reads had MD and thus need regenerated, and consequentially which reads we need to store MD vebatim (the consensus differs to the reference) and which we can generate on the fly (the consensus matches the reference), without paying the penalty of always doing this (no MD was in the original input files).
Unlike CRAM the positions are stored in order from 3' to 5' under the assumption that most short read data is high quality at the start and errors cluster towards the end. ... I am unsure of how much difference this actually makes, but CRAM4 a protoype on NovaSeq data has the FP field consuming 6.3% of the total storage cost. Needs experimentation!
For reference, it's around 4-5% saving of this field, which accounted to around 0.7% saving overall for the NovaSeq test file. Better than a kick in the teeth, but quite a lot of change for only a minimal gain. Decoding is hard, so for testing I chose the route of store all FP in a loop rather than interleaved with everything else. This is a major departure from the rest of CRAM so it would want a better implementation before putting this into usage.
One more thought - ITF8 sucks for signed values! Eg reference ID is -2 (multi-ref via RI data series), -1 (unmapped) or >= 0 (reference N). -1 has to be stored of 0xff 0xff 0xff 0xff 0x0f. The final 0x0f ruins RLE encoding too (maybe that's why the original Java code set the unused bits to 1 instead of 0?).
Naturally negative values are problematic, but ITF8 itself isn't great for compression. A more uniform 7-bit encoding (top bit => more to come) is actually better for compression ratio and also extendable to any integer size. We no longer need differences between int and long either.
That said, if the values are all within 0-255 but not between 0-127 then it's better to use 1 byte and avoid the extra IT8 or 7-bit encoding. Maybe the encoding needs to be a separate field. (We can use GAMMA, BETA, SUBEXP, etc to external instead of CORE if we wish, and just define BETA as 8-bit to store whole bytes.)
An example (using an adaptive order-0 entropy encoder):
FP values ITF8 => 18434 FP values 7-bit => 18123 FP values 8-bit => 17817
I have a question for record counter. I know record counter can be used to generate read name for reads that don't have read name stored. But record counter also introduces a strict dependency between different containers in the stream. In a distributed computing environment, if user will need merge multiple cram files, it will need require to update the record counter in the container/slice header in the merged cram output. While in bam format, the merge step can be done by simple concat the data blocks sequentially. In another words, the global record counter is not friendly for distributed computing.
@liuxf09 not sure if I'm correct but I think the record counter is only guaranteed to be unique within a slice(?)
I have used, for creating globally unique id in the file, the offset of the slice in the file plus the record counter. Again, maybe I misinterpret though
Record counter is meant to be global. In theory it should mean doing a range query and a full decode with subsequent filtering to the same range/region, would give the same names.
@liuxf09 You're correct that it doesn't play well with distributed computing, but there is nothing specifically that requires the counter has to be sequential (I think). Eg if we're splitting to 100 tasks each processing one 10k slice, then we could just increment the global counter by 1 million (100*10k) and each distributed process could use counter + 10000 * index where "index" is the task-number, assuming a child task knows it is, say, the 37th of 100.
The spec wording may need fixing to permit this though.
Edit, alternatively hack the least significant digits. Eg concatenating 100 files in parallel, take all slices and multiply the counter by 100 and add on the file number. It keeps the counter unique. I admit it's tricky though and format specific. We need to think about these sorts of things directly in the primary tools.
Large blocks and large containers
Either we must make the header able to be a series of blocks in a series of containers or make blocks be able to be larger than 2Gb. Right now block is defined to have size in itf8, which is 2Gb max. Similarly for landmarks in the container struct. We're starting to find users with headers in excess of 2Gb! Note these also can't be represented in BAM as it has int32_t for the header len (also signed), but they can get away without fluffing the header by adding M5 tags which staves off the inevitable for a little bit longer.
If we want to cope with ridiculously fragmented assemblies, we need a better way of storing headers (eg array based and not textual formatting) and/or better types than itf8.
Universally we should cull itf8 (and ltf8) and go with "int7" - a zig/zag type approach to variable sized integers that doesn't have any defined size limit other than what the implementer chooses. This also fixes the "ITF8 sucks for signed values" comment above too.
Very late to the party here. I just read the CRAMv4 document because I was interested if ITF-8 would be removed in the future. I am very relieved to see it is!
My two cents would be to add an AUX order block recording the order of the tags with a special casing if all the tags are always in the same order (the most common case). This would make it easier to get a verbatim reproduction of the SAM file. I see CRAMv4 has taken similar steps to ensure reproducibility (MD and NM recording etc.), so it would not be out of place in this new spec.
In the same light I would add some flag if =X are used rather than M in the CIGAR strings.
Having the BAM -> CRAM -> BAM conversion only have differences in the PG headers and not in the reads has some merits for using CRAM as an archival format.
Agreed on wanting better round-trips, although note if you use Picard say then your tags can be juggled in ordering and sometimes even type, but no one complains about that as far as I know.
Aux order is already in. CRAM's TL line (tag list) indicates the list of aux tags to emit, so it already stores the order. If a file has tags AAZ, BBi, CCZ or AAZ, CCZ, BBi then you can generate two TL records and have TL 0 or 1 to select the ordering. Where it failed is with MD/NM/RG which were auto-generated and always added to the end. We can add MDZ and NMi into it, which we do when we need to explicitly encode those tags because they differ to the auto-generated data, but I added MD* and NM* also which are basically placeholders indicating where in the tag list they should go without specifying the explicit data.
=/X in CIGAR is another issue. In theory we could mix them with M, although I doubt anyone does, so it's perhaps best done as a different feature type rather than a binary flag.
Unfortunately there's no time frame on CRAMv4 so I stopped working on it some time ago. The intention was to produce a new CRAM with new codecs too, but it was so hard to get any traction that I migrated those to CRAM v3.1 as the codecs could potentially just be a minimal codec library inclusion. However that also turned out to be many years to get traction too. I'm somewhat burnt out on beating my head against brick walls at the moment, unless I see external requests for CRAM4 coming from the other implementations or from a significant portion of our users. (Htslib does have a CRAM4 draft, but I don't know if it's ahead or behind of the draft spec and it's been some time since I looked at it.)
Aux order is already in. CRAM's TL line (tag list) indicates the list of aux tags to emit, so it already stores the order.
I didn't know that. Sorry, I should have read the spec better.
=/X in CIGAR is another issue. In theory we could mix them with M, although I doubt anyone does, so it's perhaps best done as a different feature type rather than a binary flag.
I just realized this does not even need to be in CRAMv4. It can also be a decoding option in samtools view. Using the reference and the sequence information it should be possible to convert all the Mops into = or X ops on the fly. That should help the round-trip checksummability as well and it will work with older CRAM versions just fine. Currently we are mostly on Illumina data, but if we switch to one of the long-read machines that would definitely be something I would make a PR for.
Unfortunately there's no time frame on CRAMv4 so I stopped working on it some time ago. [...] I'm somewhat burnt out on beating my head against brick walls at the moment, [...]
I can imagine. I am also running into lacking CRAM support at the moment. My feeling is though, that since WGS is becoming more and more common in diagnostics demand for CRAM is going to increase in the near future. Also, despite it not being in use right now, I find these improvements and the ways they are achieved very interesting. I learned about variable length quantity and that it was devised for one of my favourite stream formats (MIDI). I also learned about zig-zag encoding. I really appreciate the fact that this is all documented and CRAMv4 is not some research project that got shelved never to be heard from. Then I wouldn't have learned these things! I doubt that I am the only one with this experience.
Decoding as = and X is doable in the code, or at least was as it hasn't been tested in years, but it's not a run-time option. See the USE_X define in cram_decode.c. It's not a panacea though and mixed data sets need a flag somewhere to indicate what to do. I did think of adding an extra htslib specific auxiliary tag to store additional meta-data about an alignment, so we can use a side-channel to get better round-tripping in CRAM (albeit only via htslib), but haven't had the time or impetus to work on this.
The one thing that may perhaps make CRAMv4 a reality is if people really start demanding support for long chromosomes. This isn't possible in CRAM3, nor in BAM. Technically not in SAM either, but htslib does support long chromosomes in SAM as the data is textual and any length limitation is arbitrarily specified due to the desire for BAM round-trip working. I did think this was starting to become an issue for the Tree of Life and Earth BioGenome Project, however we then got the message back that a 64-bit CRAM wouldn't help as all their downstream processing tools are also only 32-bit, so they now just split the chromosomes up instead. :/
I tried to get a canonical way of making chromosome splitting a rigorous well defined thing too, but community feedback essentially dried up on that too. We get one comment per year it seems. (We can certainly work on this if there's more than one person in the world at any time who needs it.) https://github.com/samtools/hts-specs/issues/655
Decoding as = and X is doable in the code, or at least was as it hasn't been tested in years, but it's not a run-time option. See the USE_X define in cram_decode.c. It's not a panacea though and mixed data sets need a flag somewhere to indicate what to do. I did think of adding an extra htslib specific auxiliary tag to store additional meta-data about an alignment, so we can use a side-channel to get better round-tripping in CRAM (albeit only via htslib), but haven't had the time or impetus to work on this.
Mixed datasets are a very specific and quite rare occurence. Just having an output-fmt-option pedantic-cigar (or something more aptly named) that does the =X thing should be enough for round-trippability if the person who handles the data is aware and capable of reading the manual. It shouldn't be too invasive to implement. As soon as we need to archive ONT/PacBio data for long term storage I think I will implement that feature.
The one thing that may perhaps make CRAMv4 a reality is if people really start demanding support for long chromosomes. [...]
There are chromosomes that are cut in 6 parts now. It is getting a bit ridiculous. But I suppose there would also need to be a BAMv2 at this point, which also upgrades the 16-bit n_cigar_op to 64-bit. Or should CRAM and SAM be the only binary/textual formats used? BAM seems to be more optimized for streaming records than CRAM and is a lot easier to parse without relying on htslib.