[Question]How can i use murmur3 to reshuffle a same dataset for many times
Can i use different seed for every time reshuffling to do that?
Is the every time reshuffling expected to be well-distributed(or maybe depends on the seed)?
If i can:
Comparing to the solution that appending different salt string to the key to get a reshuffling,
the way that changing the seed has any advantage(disadvantage)?
Can i use different seed for every time reshuffling to do that?
Yes, but ... is not really clear for which purposes you need it...
the way that changing the seed has any advantage(disadvantage)?
- you don't need to mix it (with some extra data) via extra algorithm like PBKDF2 (with hmac) or similar;
- it is many time faster as another way (using mixing), but note that murmur3 is a hash-function (and not cryptographically strong), thus its primary purpose is to hash something.
With other words the hashing quality (namely bit-distribution) takes precedence over crypto-strength.
Just as example: the seed may be used to hash something equal using different "types" (quasi to differentiate the same string of different types), like in example below:
#define HT_STR 0x05
#define HT_INT 0x0A
const char *testStr = "123";
uint32_t hStr;
uint32_t hInt;
MurmurHash3_x86_32(testStr, 3, HT_STR, &hStr);
MurmurHash3_x86_32(testStr, 3, HT_INT, &hInt);
printf("key: %s, hash: STR = %08x, INT = %08x\n", testStr, hStr, hInt);
...
key: 123, hash: STR = 360bd560, INT = d6cecd5a
As disadvantage of usage the seed as salt would be the only - very fast bruteforce, using different salts. But it is "disadvantage" of all computationally faster algorithms (e. g. you don't need rainbow tables). Additionally because the key (password) will be not necessary changed (not mixed), you can work multi-threaded on the same key-buffer to find the salt if hash known (that avoids unnecessary memory access, the wash-out of the cpu-cache, etc,). Also another different scenarios are to imagine for possible crypto-analyses (not even brute-like).
It is crucial, for which purposes you'll use it...
Thanks for your reply.
My purpose is to divide a large set of uuid into N buckets for T times.
And I am hoping that all the uuid of a specific bucket will be evenly distributed in buckets after next allocation.
Example:
// for every uuid
for ( i = 0; i < T; i++) {
int bucket = hash_function(uuid, salt) % N
printf("%d round: %s is in #%d bucket", i, uuid, bucket)
}
So, instead of crypto performance, I am more concerned about which method(change mix string or seed) can get a better reshuffling if I use murmur3.
I hope I've correct understood what you meant with "reshuffling"... Because in this sense you cannot really talk about "reshuffling" using any hash-function. Hashes are not a bit-mixers. Hash-function acts as a "sieve" here, so you will get not the "original bits" in another order, but mathematically modified bit-sequence, possibly with loss of "original bits" (if one may so express it) by overflow. The larger the input buffer the more "losses" you'll get. But also if you'll apply the hash to the 128 bits, so some bits can "lose" also using MM3-128.
I'll try to explain it. You take 2 random 128-bits numbers - e. g. 63ed631fdfe5855fabc64a6d6c6748e7 and eda7071403f55e0be10dfd5e24c04307. Now we will create 5 MM3-128 for each number with seeds (1..5). Additionally I'll calculate a count of bit "1" within it (in result table I called it as "b1c").
| original | b1c | seed | mm3-128 | b1c |
|---|---|---|---|---|
63ed631fdfe5855fabc64a6d6c6748e7 |
74 | 1 | 6fb0f82a0c050df9a6053986b34a33a4 |
58 |
eda7071403f55e0be10dfd5e24c04307 |
61 | 1 | affa97ed98e2b714f193fafef32eae49 |
78 |
63ed631fdfe5855fabc64a6d6c6748e7 |
74 | 2 | 2fd05078c0beb3bde9cf4666d13e1c1b |
67 |
eda7071403f55e0be10dfd5e24c04307 |
61 | 2 | eebe048044f2d8a90c8d603b48fb568f |
60 |
63ed631fdfe5855fabc64a6d6c6748e7 |
74 | 3 | 1d186ab0dde5f4e87effd6b8fc0bb0d0 |
71 |
eda7071403f55e0be10dfd5e24c04307 |
61 | 3 | 3e5419a2f6d54e1520be34e63fadb952 |
66 |
63ed631fdfe5855fabc64a6d6c6748e7 |
74 | 4 | b9b33044562c4d557cadf1074717eb48 |
63 |
eda7071403f55e0be10dfd5e24c04307 |
61 | 4 | 9a860de7d9bcee2261847ccc680c298c |
59 |
63ed631fdfe5855fabc64a6d6c6748e7 |
74 | 5 | d41eba3fccec53104b23cde8b5d05082 |
61 |
eda7071403f55e0be10dfd5e24c04307 |
61 | 5 | ae617e19f31c97493da5a1c44c6ad84d |
64 |
So as you see the first MM3 with seed 1 has "lost" 16 (58-74) bits for the first number and has "won" 17 (78-61) bits for the second number. And you see always different values by the seeds (one time grew / another fallen). This occurs, because the 128-bits buffer used for result "overflows" differently using different seeds after applying mathematically calculation.
Possibly you need just a bits-mixer for the perfect reshuffling?