What's going on with murmur3?
This code looks an awful lot like it's doing a murmur3-32, and returning four bytes:
https://github.com/multiformats/go-multihash/blob/6f1ea18f1da5f7735ea31b5e2011da61c409e37f/sum.go#L146-L154
And this code looks an awful lot like it's registering the above code as murmur3-128, which I suspect ought to return more than four bytes:
https://github.com/multiformats/go-multihash/blob/6f1ea18f1da5f7735ea31b5e2011da61c409e37f/sum.go#L202
Should I be worried?
And if this is a bug, how comfortable are we about fixing it vs potentially breaking things that rely on the current behavior?
I guess https://github.com/multiformats/go-multihash/commit/1b82edf96afa4 is relevant, but I'm still sort of confused. Something with "128" in the name returning 4 bytes is... unexpected.
Yeah, this seems wrong. I'm not sure what's going on here.
I believe we should be using Sum64 (which is also not 128 but it's what go-unixfs uses). Honestly, we should never use this hash function for anything.
I did a bit of poking around and here's what I've got.
murmur3
- There are 3 variants - 32 bit, 128 bit optimized for 32bit machines, 128 bit optimized for 64bit machines
- We have 2 codes - murmur3-32 and murmur3-128. @Stebalien's comment https://github.com/multiformats/multihash/issues/85#issuecomment-449212222 indicates that the UnixFS version is the 64bit one (and that seems to match my poking at the test vectors below, but perhaps with some endianness issues).
- Finding reliable test vectors is a bit hard because test vectors aren't really visible, but there are some here https://github.com/aappleby/smhasher/issues/6.
- I added some of the test vectors to IPFS for future reference, as I don't want the test vectors to only be stored on pastebin
- x86-32: https://ipfs.io/ipfs/bafkreicowio3fgyeqz6ns73vzppbkbnfhfedpkemch5iuz2fc6sjnzmix4
- x64-128: https://ipfs.io/ipfs/bafkreihqwk2t3vhfmaga55d2tkhzmwol7vt5xh22xclw6icdmwhegalina
- Note for posterity: the hashes inside the CIDs are the SHA-256 of the raw file bytes
- For some reason when murmer3 was initially implemented in this library in #42 for some reason it was implemented as murmur3-32 (not what was already in use by UnixFS). Also, it was encoded as LittleEndian which happens to match the test vectors I found but does not match what the underlying Go library we're using does (which uses BigEndian).
- When we redefined/clarified that the murmur3 code was actually the murmur3-(x64)-128 code we didn't change the murmer3 implementation used here
UnixFS Sharding
- @Stebalien switched from "64bit" (i.e. first half of 128 bit) to full 128bit in https://github.com/ipfs/go-unixfs/pull/53
- Less than a month later it seems to have been reverted accidentally by https://github.com/ipfs/go-unixfs/pull/42 🤦♂️
Plan I think we should go with:
- Figure out what we're actually using in UnixFS
- If we can fix or throw some definitions into UnixFS that make the hash function actually match mumur3-x64-128
- If not then we have a tough decision to make about if we need to redefine the code
- In the meanwhile in order to merge #136, I think we should just remove the murmur3 hash entirely
- I really really hope no one is using this murmur3 implementation, but we can try and do some spelunking to warn them. This implementation isn't really valid since UnixFS was AFAICT the first user and so the definition is supposed to match that one. It definitely hasn't been valid since we clarified murmer3 -> murmer3-(x64)-128.
Note: I'll put up some code on the endianness + test vectors when I get a chance. Need another pair of eyes to make sure I'm doing it correctly, and also need some context on UnixFS usage to make sure I'm understanding how the murmur3 hashes are being used.