c-blosc icon indicating copy to clipboard operation
c-blosc copied to clipboard

Diff'ing before shuffling: even better "compression" for non-normal data?

Open arnehilmann opened this issue 11 years ago • 2 comments

Just an idea: shuffling is efficient when it yields long blocks with same values. Assuming non-normal distributed data (e.g. text, images, ...), calculating the difference before shuffling might lead to smaller values, thus increasing the chance of long blocks of zeros afterwards.

arnehilmann avatar Jul 26 '14 19:07 arnehilmann

Yes, that is a nice idea. Probably will only work with integers, as this can change the precision in floating point, but worth exploring. Will still have room for at least four different pre-conditioners in Blosc, and what you are suggesting may be good candidate. Would you like to create some PR?

FrancescAlted avatar Jul 28 '14 12:07 FrancescAlted

In fact, in many cases diff'ing series of IEEE 754 floats as integers is nearly as efficient as calculating the precise floating-point differences. This works if typical delta is less than the typical magnitude (exponent changes rarely), e.g. for measurements of temperature in K, masses, and similar non-negative quantities.

In my experiments, this + bitshuffle did provide better compression compared to plain bitshuffle, and was quite fast.

aparamon avatar May 06 '18 17:05 aparamon