echoprint-codegen
echoprint-codegen copied to clipboard
endianness of murmurhash
from ashwin:
MurmurHash2 used gives different results on big endian and little endian machines. Looks like if this is fixed, it might break codegen backward compatibility? Could someone explain if this was by design or a bug? How will the matching work if the codes are different? Edit Fingerprint.css, replace MurmurHash2 with:
//-----------------------------------------------------------------------------
// MurmurHashNeutral2, by Austin Appleby
// Same as MurmurHash2, but endian- and alignment-neutral.
// Half the speed though, alas.
unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed )
{
const unsigned int m = 0x5bd1e995;
const int r = 24;
unsigned int h = seed ^ len;
const unsigned char * data = (const unsigned char *)key;
while(len >= 4)
{
unsigned int k;
k = data[0];
k |= data[1] << 8;
k |= data[2] << 16;
k |= data[3] << 24;
k *= m;
k ^= k >> r;
k *= m;
h *= m;
h ^= k;
data += 4;
len -= 4;
}
switch(len)
{
case 3: h ^= data[2] << 16;
case 2: h ^= data[1] << 8;
case 1: h ^= data[0];
h *= m;
};
h ^= h >> 13;
h *= m;
h ^= h >> 15;
return h;
}
Looks like it is the same on little endian machines:
int main()
{
char a[100];
printf("%d, %d\n", MurmurHash2(a, sizeof(a), 2323), MurmurHashNeutral2(a, sizeof(a), 2323));
}
Output:
-330669574, -330669574
Machine used was 64bit Intel.
MurmurHashNeutral2 is twice as slow as the little-endian only version. We should only use it if we detect a big-endian architecture. https://sites.google.com/site/murmurhash/
But in practice codegenning?
(mobile)
On Jul 11, 2011, at 1:47 PM, [email protected] wrote:
MurmurHashNeutral2 is twice as slow as the little-endian only version. We should only use it if we detect a big-endian architecture. https://sites.google.com/site/murmurhash/
Reply to this email directly or view it on GitHub: https://github.com/echonest/echoprint-codegen/issues/23#issuecomment-1548327
true - less than 1/10th of a second over a 5 minute song.
Need to dig into the whole endian thing a bit more - 16bit LE PCM won't read correctly on a BE machine anyway, so there's a chance that more parts of the codegen need tweaking.
for my reference, stuff on detecting endianness: http://stackoverflow.com/questions/2100331/c-macro-definition-to-determine-big-endian-or-little-endian-machine/2100391#2100391 http://www.gnu.org/s/hello/manual/gnulib/endian_002eh.html
I think we're running into this problem - we codgenned on x86 to populate our database, then got our android app to codegen ten second samples from the mic - and nothing matches. I'm thinking this is because ARM is little endian and x86 is big ? How could this work in the IOS sample? or is that intended for the desktop and not iPhones ?