Mash
Mash copied to clipboard
Endianess
Murmur3 is sensitive to the endianess of the system and thus can produce different results depending on that. On debian this leads to breakage on big-endian systems (log) (bug tracker). Can mash on a big-endian machine produce the same output as on a little-endian? May be the tests should allow for a little wiggle-room of the numbers?
Best, Fabian
https://github.com/marbl/Mash/blob/aabd5925e7cfc097a8d89e2d8691ac4af5b95d37/src/mash/MurmurHash3.cpp#L52-L65
Passing tests with wiggle room would be fine fine for locally created sketches, but there would be spurious results against pre-built sketches that we distribute (like RefSeq). Including endianness in the sketch metadata to enforce compatibility is a possibility, but ideally we would want to generate the same hashes on big- and little-. I don't know how to make that happen without doing more research; any insight is appreciated!
This is kinda academic because very few people other than Debian actually use a big-endian machine. So in order to distribute mash on Debian and derivatives we now link against a portable implementation of Murmurhash. So you could do the same and incorporate the changes.
W.r.t. the sketches, yeah that's tricky. Quoting the Capnp website here:
But doesn’t that mean the encoding is platform-specific? NO! The encoding is defined byte-for-byte independent of any platform. […] Integers use little-endian byte order because most CPUs are little-endian, and even big-endian CPUs usually have instructions for reading little-endian data.
So you should be fine there? Unfotunately, I do not have a big-endian machine for testing.