Very bad compression on short inputs 1-127 bytes long
I wonder whether there could be some heuristics employed to drastically decrease the minimum size blosc produces for short-length inputs.
If I wanted to use blosc for things like 64bit numbers or strings like a or my string, then I always end up with much bigger sizes than the input.
I need to store data into DB as separate items and there might be lots of such small data at some point. This drastically affects the DB performance according to my measurements (order(s) of magnitude).
>>> b.compress( b'a' ).__len__()
17
>>> pickle.dumps( b'a' ).__len__()
16
>>> b.compress( bytes( str( 1 ), 'ascii' ) ).__len__()
17
>>> pickle.dumps( 1 ).__len__()
5
>>> len( b.compress( b'a' * 127 ) )
143
>>> len( b.compress( b'a' * 128 ) )
35
Any non-boxing schemes I came up with as a workaround are flawed, so it seems I'll need to get my hands dirty with memoryview() and do some custom boxing if you don't have any better ideas how to approach this issue.
Any ideas? Does blosc2 do much better on this front (I'm assuming Python bindings which I didn't try yet)?
Yeah, this kind of wild variations in compression ratio for small buffers is expected. Blosc is actually meant towards compressing large datasets, so priority in optimizing such small buffers is very low. Blosc2 is even more slanted towards large data, so the same should apply.
Ah, thanks. That clarifies it a lot.
Any concrete suggestions how to approach compression of short inputs?
Sorry, but no ideas. You will have to do your own research.
Ok, thanks anyway. Most important for me is that Blosc2 does not plan to tackle this type of issue.