echoprint-codegen icon indicating copy to clipboard operation
echoprint-codegen copied to clipboard

endianness of murmurhash

Open bwhitman opened this issue 13 years ago • 6 comments

from ashwin:

MurmurHash2 used gives different results on big endian and little endian machines. Looks like if this is fixed, it might break codegen backward compatibility? Could someone explain if this was by design or a bug? How will the matching work if the codes are different? Edit Fingerprint.css, replace MurmurHash2 with:

//-----------------------------------------------------------------------------
// MurmurHashNeutral2, by Austin Appleby

// Same as MurmurHash2, but endian- and alignment-neutral.
// Half the speed though, alas.

unsigned int MurmurHash2 ( const void * key, int len, unsigned int seed )
{
    const unsigned int m = 0x5bd1e995;
    const int r = 24;

    unsigned int h = seed ^ len;

    const unsigned char * data = (const unsigned char *)key;

    while(len >= 4)
    {
        unsigned int k;

        k  = data[0];
        k |= data[1] << 8;
        k |= data[2] << 16;
        k |= data[3] << 24;

        k *= m; 
        k ^= k >> r; 
        k *= m;

        h *= m;
        h ^= k;

        data += 4;
        len -= 4;
    }

    switch(len)
    {
    case 3: h ^= data[2] << 16;
    case 2: h ^= data[1] << 8;
    case 1: h ^= data[0];
            h *= m;
    };

    h ^= h >> 13;
    h *= m;
    h ^= h >> 15;

    return h;
} 

Looks like it is the same on little endian machines:

int main()
{
  char a[100];
  printf("%d, %d\n", MurmurHash2(a, sizeof(a), 2323), MurmurHashNeutral2(a, sizeof(a), 2323));
}
Output:
-330669574, -330669574

Machine used was 64bit Intel.

bwhitman avatar Jul 11 '11 17:07 bwhitman

MurmurHashNeutral2 is twice as slow as the little-endian only version. We should only use it if we detect a big-endian architecture. https://sites.google.com/site/murmurhash/

alastair avatar Jul 11 '11 17:07 alastair

But in practice codegenning?

(mobile)

On Jul 11, 2011, at 1:47 PM, [email protected] wrote:

MurmurHashNeutral2 is twice as slow as the little-endian only version. We should only use it if we detect a big-endian architecture. https://sites.google.com/site/murmurhash/

Reply to this email directly or view it on GitHub: https://github.com/echonest/echoprint-codegen/issues/23#issuecomment-1548327

bwhitman avatar Jul 11 '11 17:07 bwhitman

true - less than 1/10th of a second over a 5 minute song.

alastair avatar Jul 11 '11 18:07 alastair

Need to dig into the whole endian thing a bit more - 16bit LE PCM won't read correctly on a BE machine anyway, so there's a chance that more parts of the codegen need tweaking.

alastair avatar Jul 15 '11 17:07 alastair

for my reference, stuff on detecting endianness: http://stackoverflow.com/questions/2100331/c-macro-definition-to-determine-big-endian-or-little-endian-machine/2100391#2100391 http://www.gnu.org/s/hello/manual/gnulib/endian_002eh.html

alastair avatar Jul 15 '11 17:07 alastair

I think we're running into this problem - we codgenned on x86 to populate our database, then got our android app to codegen ten second samples from the mic - and nothing matches. I'm thinking this is because ARM is little endian and x86 is big ? How could this work in the IOS sample? or is that intended for the desktop and not iPhones ?

ludflu avatar Dec 29 '11 02:12 ludflu