squirrel
squirrel copied to clipboard
String hash function is not optimal
I've noticed that a string hash function is defined in squirrel/sqstring.h
like this:
inline SQHash _hashstr (const SQChar *s, size_t l)
{
SQHash h = (SQHash)l; /* seed */
size_t step = (l>>5)|1; /* if string is too long, don't hash all its chars */
for (; l>=step; l-=step)
h = h ^ ((h<<5)+(h>>2)+(unsigned short)*(s++));
return h;
}
which is, I believe, borrowed from Lua's hash function.
TString *luaS_newlstr (lua_State *L, const char *str, size_t l) {
GCObject *o;
unsigned int h = cast(unsigned int, l); /* seed */
size_t step = (l>>5)+1; /* if string is too long, don't hash all its chars */
size_t l1;
for (l1=l; l1>=step; l1-=step) /* compute hash */
h = h ^ ((h<<5)+(h>>2)+cast(unsigned char, str[l1-1]));
// ...
}
But these functions are not exactly the same.
First, Squirrel's step
is calculated as bitwise OR on the length of the string divided by 32 and 1 (size_t step = (l>>5)|1
). It means the step is 1 until the length reaches 64 and suddenly jumps to 3. I think step
is defined in order to limit computational cost of hash function on very long strings, but the current implementation doesn't seem reasonable to me (It examines all the characters when the length is 63, but for lengths above 64, the number of characters examined won't exceed 32).
And more importantly, it only takes first 32 characters into consideration (63 characters in the extreme case) because of s++
. If some strings share first 32 characters and have the same length, they will fall into the same bin of a hash table, which could be very inefficient to lookup. On the other hand, Lua's implementation scans the string evenly (but limited number of times) to avoid this issue.
If my point is unclear, consider this little demo program:
#include <stdio.h>
#include <string.h>
#include <time.h>
typedef unsigned long SQHash;
typedef char SQChar;
inline SQHash _hashstr (const SQChar *s, size_t l)
{
SQHash h = (SQHash)l; /* seed */
size_t step = (l>>5)|1; /* if string is too long, don't hash all its chars */
for (; l>=step; l-=step)
h = h ^ ((h<<5)+(h>>2)+(unsigned short)*(s++));
return h;
}
const SQChar *tests[] = {
"",
"0123456789",
"01234567890123456789",
"0123456789012345678901234567890123456789",
"012345678901234567890123456789012345678901234567890123456789",
"0123456789012345678901234567890123456789012345678901234567890123456789",
"0123456789012345678901234567890123456789012345678901234567890000000000",
"0123456789012345678901234567890123456789012345678901234567891111111111",
};
int main(){
int i;
for(i = 0; i < sizeof tests/sizeof*tests; i++){
printf("%u %u %s\n", _hashstr(tests[i], strlen(tests[i])), strlen(tests[i])>>5|1, tests[i]);
}
}
It emits following output in my environment (MinGW 64bit):
0 1
2706608976 1 0123456789
2934042991 1 01234567890123456789
3784033420 1 0123456789012345678901234567890123456789
3553593974 1 012345678901234567890123456789012345678901234567890123456789
2197401726 3 0123456789012345678901234567890123456789012345678901234567890123456789
2197401726 3 0123456789012345678901234567890123456789012345678901234567890000000000
2197401726 3 0123456789012345678901234567890123456789012345678901234567891111111111
Notice the last 3 lines. They have the same hash value!
I don't even think Lua's hash function is particularly sophisticated. The idea of having wide steps on long strings to reduce computational cost is great, but I don't see how the algorithm can effectively hash the input. There are many lightweight, documented (and not necessarily cryptographic) hash functions like FNV-1a (used in Tcl) or DJBX33A (used in Python and PHP).
I measured execution time for 10000000 times Squirrel's hash function and FNV-1a on a string with 40 characters. The result was:
Squirrel hash Time: 1.09
FNV_1a Time: 1.834
which seems comparable.
When I turn on optimization -O3
, FNV_1a becomes winner.
Squirrel hash Time: 0.705
FNV_1a Time: 0.399
Apparently character skipping has some overhead.
Squirrel's hash function was taken verbatim from Lua 4.0 (back in 2003, see http://www.lua.org/source/4.0.1/lstring.c.html#hash_s). I guess Lua has changed it a bit along the way. I don't know much about hashing functions theory so if we can find a function that performs better, I'm open to change. I just feel that character skipping is a requirement, it's quite common(at least for me) to deal with large strings(64k+).
Thanks for reporting I updated SquiLu to use the lua 5.3.4 version.