Mash icon indicating copy to clipboard operation
Mash copied to clipboard

Endianess

Open kloetzl opened this issue 5 years ago • 2 comments

Murmur3 is sensitive to the endianess of the system and thus can produce different results depending on that. On debian this leads to breakage on big-endian systems (log) (bug tracker). Can mash on a big-endian machine produce the same output as on a little-endian? May be the tests should allow for a little wiggle-room of the numbers?

Best, Fabian

https://github.com/marbl/Mash/blob/aabd5925e7cfc097a8d89e2d8691ac4af5b95d37/src/mash/MurmurHash3.cpp#L52-L65

kloetzl avatar Jan 07 '19 13:01 kloetzl

Passing tests with wiggle room would be fine fine for locally created sketches, but there would be spurious results against pre-built sketches that we distribute (like RefSeq). Including endianness in the sketch metadata to enforce compatibility is a possibility, but ideally we would want to generate the same hashes on big- and little-. I don't know how to make that happen without doing more research; any insight is appreciated!

ondovb avatar Mar 18 '19 20:03 ondovb

This is kinda academic because very few people other than Debian actually use a big-endian machine. So in order to distribute mash on Debian and derivatives we now link against a portable implementation of Murmurhash. So you could do the same and incorporate the changes.

W.r.t. the sketches, yeah that's tricky. Quoting the Capnp website here:

But doesn’t that mean the encoding is platform-specific? NO! The encoding is defined byte-for-byte independent of any platform. […] Integers use little-endian byte order because most CPUs are little-endian, and even big-endian CPUs usually have instructions for reading little-endian data.

So you should be fine there? Unfotunately, I do not have a big-endian machine for testing.

kloetzl avatar Mar 18 '19 20:03 kloetzl