meow_hash icon indicating copy to clipboard operation
meow_hash copied to clipboard

Meow 0.6 candidate functions

Open NoHatCoder opened this issue 2 years ago • 4 comments

Wanted to share what I have been working on, still needs some work, but I have 4 hash functions that I'm so far reasonably pleased with. Check them out: https://github.com/NoHatCoder/Meow-Hash-0.6-Candidate

Not in the code, but I also finally figured how we might utilize AVX512 without overflowing the registers too much on older CPUs. We would run 4 parallel tracks that don't intermingle before finalization. In 128 bit code, for each block of several KiB do one lane at a time, that way we don't have to swap what lane resides in registers all the time. Finalization gets more complicated, so we probably want to fall back to the plain 128 bit version for short input.

Poke @cmuratori @petersn

NoHatCoder avatar Aug 12 '21 20:08 NoHatCoder

Awesome! I will take a look.

Separately, I am curious: the four-parallel-track construction is how I did the original Meow Hash (the one that didn't have enough diffusion). I am curious: if it can work for AVX-512, why was it not able to be retained from the original Meow Hash for 128-bit? Because in general, parallel-stream construction is the best kind of construction for throughput, since AES instructions have 4-cycle latency...

- Casey

cmuratori avatar Aug 12 '21 21:08 cmuratori

I didn't think of this until now, maybe you considered this construction obvious, but I just thought that if we did parallel tracks we would run out of registers, and thus add a bunch of overhead to the 128 bit implementation.

NoHatCoder avatar Aug 13 '21 05:08 NoHatCoder

Well, it's not so much that I considered it obvious as that it was the original design of Meow Hash :) My assumption was that since you didn't use any parallel construction in your blocks for the 128-bit version, your reasoning was that the hash was not as good if it was mixed at the end. But I guess that is not true? If not, that is excellent, because the more parallel tracks you can do, the faster you can go, typically, and that's why I designed the original one that way.

  • Casey

cmuratori avatar Aug 13 '21 06:08 cmuratori

So, since I was looking at Chacha20 and AES-256-ctr recently, I also have some important updates: it turns out both Zen2/3 and Tiger Lake added a second AES unit. That means that parallel construction becomes much more important now for speed, because the newer x64 chips can issue two AES instructions every cycle even without VAES!

I need to take a look at what you've got so far @NoHatCoder and I'll think about how it will arrange.

- Casey

cmuratori avatar Aug 15 '21 00:08 cmuratori