c-blosc2 icon indicating copy to clipboard operation
c-blosc2 copied to clipboard

BWT filter

Open froody opened this issue 10 months ago • 3 comments

I'd like to add the Burrows Wheeler Transform as a filter in blosc, but for arbitrary byte sequences it requires adding one additional token (or an offset into the sequence) and there doesn't seem to be a way to do this in blosc. Am I missing something? Can I store per-chunk metadata that's available at decompression time?

froody avatar Feb 10 '25 08:02 froody

In general, Blosc2 only allows for 1 byte per user-defined codec to be stored in the header of the compressed chunk; see udcodec and compcode_meta. So, if metadata for the chunk can be fit in 1 byte, I think this is the way to go.

If one byte per chunk is not enough, there are several ways to proceed, but perhaps adding a metalayer for the Blosc2 frame is your best bet.

FrancescAlted avatar Feb 10 '25 13:02 FrancescAlted

Hmm how would I encode a per-chunk value in vlmeta? I think I would need to either pre-allocate a large enough buffer before compression, and then index into this using nchunk, or create a new vlmeta key per chunk. Is the vlmeta api threadsafe, i.e. can it be called from within filter forward/backward?

froody avatar Feb 10 '25 22:02 froody

Uh, I am afraid that we did not try to make vlmeta API threadsafe, so assume that it is not.

Also, Zstandard is a very nice codec, and with the clevels, it can cover a lot of ground (from quick and moderate cratios to slow but very good cratios); do you think that BWT could beat Zstandard in some scenario? Just curious.

FrancescAlted avatar Feb 12 '25 09:02 FrancescAlted