humanhash
humanhash copied to clipboard
Revised compress method
I made 2 changes to the "compress" method:
- it will return fewer than the target number of bytes if it is given a digest that is smaller than the target size already (instead of throwing an error)
- it spreads the modulo bytes around rather than dumping them all into the final byte (I think this might preserve some entropy, no?)
I'm maintaining a Python 3 fork of humanhash on GitHub and PyPI.
Can you add some comments to the code to explain what this is doing? And why it's better than the existing compress method? Sorry to dig this up from four years ago...
Happy to resurrect this! Compression method comments are added.
Why is this better? The old method divided the bytes into the target number of segments and after even division, placed all remainder bytes into the final segment. This meant that the effect of the remainder bytes on overall entropy was confined to the final byte. In the new method, the remainder bytes are selected throughout the input bytes and are distributed evenly among the target segments, allowing them to express more entropy. The compression per input byte is more even, since the biggest difference in the number of input bytes per output byte is 1.
For example:
compress_old([123,456,798,147], 4)
# -> [123, 456, 789, 147]
compress_old([123,456,789,147,258,369,321],4)
# -> [123, 456, 789, 417] (only the last byte has changed)
compress_new([123,456,798,147], 4)
# -> [123, 456, 789, 147]
compress_new([123,456,789,147,258,369,321],4)
# -> [435, 902, 115, 321] (all 4 bytes have changed)
As an aside, I have an equivalent compress method prepared for the Javascript port and I will create a pull request there if this is merged.
Thanks!
I'm maintaining the humanhash3
PyPI package, and if you can create a PR to my repo I'd be happy to merge it in. Thanks for the explanation! 😄