Better hashing
Proposal
I think we should update our hash functions and at the same time make some changes to the Hashable protocol. Ideas for these updates come from recent developments in languages such as Swift and Rust. So the proposal has two parts; let's start with the hash functions.
Our present hash functions (in base/builtin/hash.c) go back to early days in Acton, ca 2019. At the time it seemed a good idea to follow Python's hash strategies, so we have different ad hoc solutions for integers, strings/bytes, and floats.
Current trends in hashing have two distinctive features:
- Hash functions are designed to take an arbitrary number of bytes as input and thus to be usable for all types, from i16 (two bytes) to very long strings. There is an obvious advantage of having one basic hash function and wrap calls of it in specific hash functions for different types. However, there are many, many proposals and it is difficult to choose one.
- To prevent "hash flood attacks", hash functions should take an additional argument, a random seed, which is changed either for every dict/set or for every start of the runtime system. This means that hashing a specific value will be different between runs; in particular this has implications for serializations; one must serialize a dict to a list of key/value pairs and rebuild the hash table when deserializing.
I have experimented with two possible choices for a new basic hash function:
- SipHash, in particular SipHash-1-3. This is/has been the choice in Rust and Swift for several years, but has been criticized for being slow. Both languages have developed alternatives that are faster, but possibly less secure. In particular, this means that users have a choice between performance and security; we will come back to that.
- xxHash which claims (with some support from the comparison site smhasher) to be (among) the fastest of the hash functions that pass all the quality tests on the site.
Both these have good C implementations available. The reference implementation of SipHash is in the public domain, while xxHash is offered under the BSD 2-Clause License, whatever that means. I have made only little testing: our test file test/builtins_auto/dict_test2.act, but modified by wrapping it in aloop to run the whole thing thousands of times, and (guess what?) telemetrify. The two tests give the same indications: xxHash has almost the same speed as our current hashing and SipHash is significantly slower. For telemetrify, with the biggest test file and our current hash functions, the user time is 13:20 minutes. With xxHash it is 13:30 minutes and with SipHash 13:50 minutes. (Recalling that most of the time is in GC, the differences are bigger than what they seem to be). It may be surprising that our present hashes are the fastest. It should not be: Python (and we) hash integer values essentially to themselves, which is about as fast as can be... So, for the moment, we should not change for speed but for software engineering purposes -- and for the reason to be discussed next.
At present, we offer no support for the user who has defined a new class and want to make it implement Hashable. Say, we have
class point(value):
def __init__(self, x : i64, y : i64):
self.x = x
self.y = y
...
To hash a point, it seems reasonable to hash x and y, but how to combine them? The user is left to come up with an ad hoc solution.
Since the underlying hash function can hash an arbitrary number of bytes, we want a way to feed it with x and y in byte form. Thus we seem to need to make hashing a several-stage process: we initiate some internal state, then we accept several sequences of bytes (8 for x, then 8 for y) and finally we retrieve the hash value. Both hash functions above support this idea: they process their byte sequence in chunks of typically 8 bytes and maintain a small state which is updated for each chunk. So we could try something like
class hasher(object):
def __init__(self, seed : ?bytes) -> None:
NotImplemented #self.state = f0(seed)
def acceptBytes (self, b : bytes) -> None:
NotImplemented #self.state = f1(b, self.state)
def finalize(self) -> int:
NotImplemented #return f2(state)
where the implementation (in C) uses functions f0, f1 and f2 specific to the actual hash function.
But it is not obvious how to use a hasher object to hash a point; we would create a hasher object h and then call h.acceptBytes twice, but how can we make 8 bytes of an i64? For a bytes object it is easy:
def __hash__(self):
h = hasher()
h.acceptBytes(self)
return h.finalize()
The best idea (see variants in Swift and Rust) seems to me to be to extend the Hashable protocol:
protocol Hashable(Eq):
putBytes : (hasher) -> None
__hash__ : () -> int
def __hash__(self):
h = hasher()
self.putBytes(h)
return h.finalize()
The builtin types, (implemented in C) can easily implement putBytes (which in essence needs to call h.acceptBytes) since they can convert themselves to bytes by coercion. And in the class point we can write
extension point(Eq):
def __eq__(p,q):
return p.x == q.x and p.y == q.y
extension point(Hashable):
def putBytes(self,h):
self.x.putBytes(h)
self.y.putBytes(h)
So to implement Hashable for a user-defined class is trivial : just implement putBytes by calling putBytes on components of the state. hash uses the default implementation and need not be implemented. The builtin types will have streamlined implementations of hash.
Think now of (just to give an example)
class city(value):
def __init__(self, name : str, location : point):
...
We now need to call putBytes on the name and on the location and are done. But we can also choose to hash only on location, which defines the city uniquely, so city("Copenhagen",...) and city("Köpenhamn",...) with equal points will hash to the same value.
The above is easy to do, when one has decided on a hash function.
In a next step one could add seeds, in order to be protected against DDOS attacks. Both hash functions are ready for this; probably hash should get an optional seed parameter, but since it is optional we can leave it for the moment. To take this step we need to get access to some truly random data, presumably in the rts. We also need to rewrite (de)serialization for dicts and sets.
A remaining problem is how/if to provide users with a choice between different underlying hash functions (fast and dirty or slow and secure). I haven't yet thought of this, but decided to create this issue anyhow, before it is again forgotten and left to an unknown future...
Motivation
Alternatives
I forgot one thing: hash values should be u64, not int, as soon as we have integer subtyping.
I think one of the points of conversation around hashing in rust and other languages is that the default hash function should be fast and people should pick a dos safe hash where necessary, like in code exposed to the public internet. Is this also your view?
Just found gxhash https://github.com/ogxd/gxhash
Seems super fast!?
Hash functions for hash tables have two requirements (cryptographic hash functions have more):
- they should be fast
- they should be collision-avoiding.
To focus on speed only is a mistake. The protection from DOS attacks relies on having a second input (apart from the key) which can be randomly varied; it is only loosely correlated with speed.
All the time new hash functions are advertized as being the fastest ever and with good collision properties. I would not recommend trying a new hash function, which is so far only implemented in Rust, requires SIMD and AES intrinsics (no fallback if these are not available), and which has probably(?) so far been scrutinized only by its designer. Of course, I haven't scrutinized SipHash or xxHash either, but others have. We should not lead the pack of programming languages in adopting new hash functions. It is also instructive to read the gxhash paper, where the author explains his motivations. He works with "high-performance AdServing backends managing billions of auctions daily", which require that hash operations have "execution times ranging from nanoseconds to microseconds". I think we can return to optimizing hash functions when we have witness inlining, unboxed arithmetic and a custom GC.
I think this is not easy. I am thinking about the problem and hope to come with a well-grounded proposal soon.
@sydow I too read about gxhash yesterday and ... well, I probably shouldn't have brought it up here. I just found the link while looking for other things but it's a distraction at this point.
Right, so some salt / seed-input to avoid DoS is good. I like to remember that Rust (or was it some other lang) had used an overly complex algorithm for hashing that was slower, just because they wanted better DoS resistance... and they considered this a mistake.
Looking forward to your proposal :)
Zig, Nim, V and Go use Wyhash for hashing so I think Wyhash should be considered. There is a Zig implementation of it that seems fairly fast: https://github.com/ziglang/zig/pull/15969
I never finished that work so we can call Zig functions. I should...
Sorry for being late to respond. Here are my comments:
- I think that now that we see integer subtyping coming soon, fixing hash should wait for that.
- A major decision is whether to use one hash function for all types, converting objects to bytes and then use a general hash function that accepts an arbitrary number of bytes. I always thought that was the way to go, but I'm beginning to change my mind. Maybe we should continue to hash integers to themselves. It is unbeatably fast, and seems to work well in Python, where the original motivation was that integer 15 and float value 15.0 must hash to the same value since Python is untyped. Then, obviously one cannot use a general hash function. But being fast is not everything; an identity hash function creates many collisions. Imagine a situation where all keys end in two zero bytes; then all keys hash to zero for hash tables of size up to tens of thousands. I have written demo programs to show how things go wrong. Only they don't. Python has a quite clever rehashing strategy when collisions occur, so also quite catastrophical collision patterns are handled gracefully. For the moment, let's stick to hashing integers to themselves. For all other types, let's use a good general hash function.
- I think the choice between siphash, xxhash, wyhash or ... is not very important. It is very easy to change; they all have very similar interface.
- The part of the proposal that suggests changing Hashable to make it easier to implement the protocol for uesr defined classes is an important part of the proposal. Please look into that and see what you think.