echoprint-codegen endianness of murmurhash

from ashwin:

MurmurHash2 used gives different results on big endian and little endian machines. Looks like if this is fixed, it might break codegen backward compatibility? Could someone explain if this was by design or a bug? How will the matching work if the codes are different? Edit Fingerprint.css, replace MurmurHash2 with:

//-----------------------------------------------------------------------------
// MurmurHashNeutral2, by Austin Appleby

// Same as MurmurHash2, but endian- and alignment-neutral.
// Half the speed though, alas.

unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed )
{
    const unsigned int m = 0x5bd1e995;
    const int r = 24;

    unsigned int h = seed ^ len;

    const unsigned char * data = (const unsigned char *)key;

    while(len >= 4)
    {
        unsigned int k;

        k  = data[0];
        k |= data[1] << 8;
        k |= data[2] << 16;
        k |= data[3] << 24;

        k *= m; 
        k ^= k >> r; 
        k *= m;

        h *= m;
        h ^= k;

        data += 4;
        len -= 4;
    }

    switch(len)
    {
    case 3: h ^= data[2] << 16;
    case 2: h ^= data[1] << 8;
    case 1: h ^= data[0];
            h *= m;
    };

    h ^= h >> 13;
    h *= m;
    h ^= h >> 15;

    return h;
}

Looks like it is the same on little endian machines:

int main()
{
  char a[100];
  printf("%d, %d\n", MurmurHash2(a, sizeof(a), 2323), MurmurHashNeutral2(a, sizeof(a), 2323));
}

Output:
-330669574, -330669574

Machine used was 64bit Intel.

Jul 11 '11 17:07 bwhitman

MurmurHashNeutral2 is twice as slow as the little-endian only version. We should only use it if we detect a big-endian architecture. https://sites.google.com/site/murmurhash/

Jul 11 '11 17:07 alastair

But in practice codegenning?

(mobile)

On Jul 11, 2011, at 1:47 PM, [email protected] wrote:

MurmurHashNeutral2 is twice as slow as the little-endian only version. We should only use it if we detect a big-endian architecture. https://sites.google.com/site/murmurhash/

Reply to this email directly or view it on GitHub: https://github.com/echonest/echoprint-codegen/issues/23#issuecomment-1548327

Jul 11 '11 17:07 bwhitman

true - less than 1/10th of a second over a 5 minute song.

Jul 11 '11 18:07 alastair

Need to dig into the whole endian thing a bit more - 16bit LE PCM won't read correctly on a BE machine anyway, so there's a chance that more parts of the codegen need tweaking.

Jul 15 '11 17:07 alastair

for my reference, stuff on detecting endianness: http://stackoverflow.com/questions/2100331/c-macro-definition-to-determine-big-endian-or-little-endian-machine/2100391#2100391 http://www.gnu.org/s/hello/manual/gnulib/endian_002eh.html

Jul 15 '11 17:07 alastair

I think we're running into this problem - we codgenned on x86 to populate our database, then got our android app to codegen ten second samples from the mic - and nothing matches. I'm thinking this is because ARM is little endian and x86 is big ? How could this work in the IOS sample? or is that intended for the desktop and not iPhones ?

Dec 29 '11 02:12 ludflu

echoprint-codegen echoprint-codegen copied to clipboard

endianness of murmurhash

echoprint-codegen
echoprint-codegen copied to clipboard