smhasher icon indicating copy to clipboard operation
smhasher copied to clipboard

[Question]How can i use murmur3 to reshuffle a same dataset for many times

Open wooclary opened this issue 8 years ago • 3 comments

Can i use different seed for every time reshuffling to do that? Is the every time reshuffling expected to be well-distributed(or maybe depends on the seed)?

If i can: Comparing to the solution that appending different salt string to the key to get a reshuffling, the way that changing the seed has any advantage(disadvantage)?

wooclary avatar Nov 02 '17 05:11 wooclary

Can i use different seed for every time reshuffling to do that?

Yes, but ... is not really clear for which purposes you need it...

the way that changing the seed has any advantage(disadvantage)?

  • you don't need to mix it (with some extra data) via extra algorithm like PBKDF2 (with hmac) or similar;
  • it is many time faster as another way (using mixing), but note that murmur3 is a hash-function (and not cryptographically strong), thus its primary purpose is to hash something.

With other words the hashing quality (namely bit-distribution) takes precedence over crypto-strength.

Just as example: the seed may be used to hash something equal using different "types" (quasi to differentiate the same string of different types), like in example below:

#define HT_STR 0x05
#define HT_INT 0x0A

const char *testStr = "123";
uint32_t hStr;
uint32_t hInt;
MurmurHash3_x86_32(testStr, 3, HT_STR, &hStr);
MurmurHash3_x86_32(testStr, 3, HT_INT, &hInt);

printf("key: %s, hash: STR = %08x, INT = %08x\n", testStr, hStr, hInt);
...
key: 123, hash: STR = 360bd560, INT = d6cecd5a

As disadvantage of usage the seed as salt would be the only - very fast bruteforce, using different salts. But it is "disadvantage" of all computationally faster algorithms (e. g. you don't need rainbow tables). Additionally because the key (password) will be not necessary changed (not mixed), you can work multi-threaded on the same key-buffer to find the salt if hash known (that avoids unnecessary memory access, the wash-out of the cpu-cache, etc,). Also another different scenarios are to imagine for possible crypto-analyses (not even brute-like).

It is crucial, for which purposes you'll use it...

sebres avatar Nov 02 '17 10:11 sebres

Thanks for your reply.

My purpose is to divide a large set of uuid into N buckets for T times. And I am hoping that all the uuid of a specific bucket will be evenly distributed in buckets after next allocation.

Example:

// for every uuid
for ( i = 0; i < T; i++) {
    int bucket = hash_function(uuid, salt) % N
    printf("%d round: %s is in #%d bucket", i, uuid, bucket)
}

So, instead of crypto performance, I am more concerned about which method(change mix string or seed) can get a better reshuffling if I use murmur3.

wooclary avatar Nov 09 '17 21:11 wooclary

I hope I've correct understood what you meant with "reshuffling"... Because in this sense you cannot really talk about "reshuffling" using any hash-function. Hashes are not a bit-mixers. Hash-function acts as a "sieve" here, so you will get not the "original bits" in another order, but mathematically modified bit-sequence, possibly with loss of "original bits" (if one may so express it) by overflow. The larger the input buffer the more "losses" you'll get. But also if you'll apply the hash to the 128 bits, so some bits can "lose" also using MM3-128.

I'll try to explain it. You take 2 random 128-bits numbers - e. g. 63ed631fdfe5855fabc64a6d6c6748e7 and eda7071403f55e0be10dfd5e24c04307. Now we will create 5 MM3-128 for each number with seeds (1..5). Additionally I'll calculate a count of bit "1" within it (in result table I called it as "b1c").

original b1c seed mm3-128 b1c
63ed631fdfe5855fabc64a6d6c6748e7 74 1 6fb0f82a0c050df9a6053986b34a33a4 58
eda7071403f55e0be10dfd5e24c04307 61 1 affa97ed98e2b714f193fafef32eae49 78
63ed631fdfe5855fabc64a6d6c6748e7 74 2 2fd05078c0beb3bde9cf4666d13e1c1b 67
eda7071403f55e0be10dfd5e24c04307 61 2 eebe048044f2d8a90c8d603b48fb568f 60
63ed631fdfe5855fabc64a6d6c6748e7 74 3 1d186ab0dde5f4e87effd6b8fc0bb0d0 71
eda7071403f55e0be10dfd5e24c04307 61 3 3e5419a2f6d54e1520be34e63fadb952 66
63ed631fdfe5855fabc64a6d6c6748e7 74 4 b9b33044562c4d557cadf1074717eb48 63
eda7071403f55e0be10dfd5e24c04307 61 4 9a860de7d9bcee2261847ccc680c298c 59
63ed631fdfe5855fabc64a6d6c6748e7 74 5 d41eba3fccec53104b23cde8b5d05082 61
eda7071403f55e0be10dfd5e24c04307 61 5 ae617e19f31c97493da5a1c44c6ad84d 64

So as you see the first MM3 with seed 1 has "lost" 16 (58-74) bits for the first number and has "won" 17 (78-61) bits for the second number. And you see always different values by the seeds (one time grew / another fallen). This occurs, because the 128-bits buffer used for result "overflows" differently using different seeds after applying mathematically calculation.

Possibly you need just a bits-mixer for the perfect reshuffling?

sebres avatar Nov 10 '17 08:11 sebres