acton icon indicating copy to clipboard operation
acton copied to clipboard

On the choice of integer / value hashing algorithm

Open plajjan opened this issue 3 years ago • 4 comments

We should have a proper algorithm for hashing of ints and other things. We should probably aim for something DoS resistant too.

Conversation started in https://github.com/actonlang/acton/pull/992#issuecomment-1292166708:

The hash functions for int should be considered temporary: it uses only the last 64-bit digit. But this is part of the more general question to change all hash functions in builtin/hash.c. This was one of the first C files written in this project, and reiles too much on Python hash decisions. There are several much better proposals out there; what is annoying is that it is difficult to find good advice as to which of the modern alternatives to choose. This has partly to do with the question of whether protection against hash flooding attacks are important; when Acton conquers the clooud this may be the case...

@plajjan found this the other day: https://twitter.com/pcwalton/status/1583931446305038336

So .NET appears to switch from a simple & fast hashing algorithmm to something DoS resistant. Can we do the same?

plajjan avatar Oct 27 '22 07:10 plajjan

There is one obvious candidate: siphash. It is widely used and developed by people with a very good reputation. For proper DoS protection, it should be combined with a truly random key, generated from a random source at program startup. This also requires that we rethink serialization of dictionaries and sets; we will need to store a list of key/value pairs and a list of elements, respectively, and rebuild at deserialization. We cannot just memcopy the content of the hash table, since the hash function will have a new key the next time we run the program.

sydow avatar Oct 27 '22 12:10 sydow

I missed your link to the comment on siphash and Rust. Yes, my understanding is that siphash is a bit slow, in particular on long strings, which I doubt is our typical use case. But my understanding is not very well-founded...

sydow avatar Oct 27 '22 12:10 sydow

I think .NET is using xxHash and people are talking about it as a very fast hash algorithm: https://github.com/Cyan4973/xxHash

Is this idea of using one hash and switching to a more expensive one feasible? Like could we switch? How would we switch? I imagine we would want to decide this per table/dict, right? So fort most places we'd need a hash we can run with the fast one but if we notice lots of collisions in some dict we can switch to another?

I'm really not used to thinking about this problem so maybe I'm going about it all wrong.

plajjan avatar Oct 27 '22 22:10 plajjan

If we have to pick one, I agree that the safer siphash is a better choice.

plajjan avatar Oct 27 '22 22:10 plajjan