BWT filter
I'd like to add the Burrows Wheeler Transform as a filter in blosc, but for arbitrary byte sequences it requires adding one additional token (or an offset into the sequence) and there doesn't seem to be a way to do this in blosc. Am I missing something? Can I store per-chunk metadata that's available at decompression time?
In general, Blosc2 only allows for 1 byte per user-defined codec to be stored in the header of the compressed chunk; see udcodec and compcode_meta. So, if metadata for the chunk can be fit in 1 byte, I think this is the way to go.
If one byte per chunk is not enough, there are several ways to proceed, but perhaps adding a metalayer for the Blosc2 frame is your best bet.
Hmm how would I encode a per-chunk value in vlmeta? I think I would need to either pre-allocate a large enough buffer before compression, and then index into this using nchunk, or create a new vlmeta key per chunk. Is the vlmeta api threadsafe, i.e. can it be called from within filter forward/backward?
Uh, I am afraid that we did not try to make vlmeta API threadsafe, so assume that it is not.
Also, Zstandard is a very nice codec, and with the clevels, it can cover a lot of ground (from quick and moderate cratios to slow but very good cratios); do you think that BWT could beat Zstandard in some scenario? Just curious.