rfcs icon indicating copy to clipboard operation
rfcs copied to clipboard

Move Hashable to builtin package and make it easier to make objects Hashable

Open mfelsche opened this issue 5 years ago • 12 comments

Rendered

Suggested changes: https://github.com/ponylang/ponyc/compare/hashing-builtin?expand=1#diff-3fcbdccf4a6378f88bdb211e3838307bR9

mfelsche avatar Jan 06 '20 20:01 mfelsche

Is there anything in the standard library that is outside of the collections package that implements these interfaces?

SeanTAllen avatar Jan 06 '20 20:01 SeanTAllen

Only builtin classes like the numeric types, String and Pointer implement a hash or hash64() function. Nothing else inside of stdlib.

mfelsche avatar Jan 06 '20 20:01 mfelsche

During sync it was mentioned, that keeping hashing free from allocation is very important as hashing often happens in hot-paths.

It was discussed that moving Hashable etc to builtin has some use cases outside of the collections package. A suggestion was made to move hashing related logic (Hashable etc. except HashMap) into a separate hashing package and use it from collections. But as the RFC discussion in the comments already went along, given we use the hash_with approach, we would need the Hashing and HashState primitives/aliases available in builtin, if we want to apply this new approach to the builtin classes like numerics and String.

Will update the RFC accordingly.

mfelsche avatar Jan 07 '20 19:01 mfelsche

I did a POC of this RFC in on the hashing_builtin branch of the pony repo and created a small benchmark to compare the different approaches to hashing (main difference is in hashing numbers and using a pure pony siphash implementation everywhere): https://gist.github.com/mfelsche/da13a70d6e4048bd70800a46f23b7ef4

TL;DR: hashing numbers is around 2 times slower, but that actually doesnt affect e.g. HashMap.update. Hashing Strings seems to be a bit faster.

If those benchmarks seem odd to you, I can happily redo them. If those timings are not a blocker for anyone, I will go on and finalize this RFC and add the finishing changes.

mfelsche avatar Sep 07 '20 22:09 mfelsche

Do we have an understanding of what causes the perf difference for hashing numbers?

Is it more about the overhead of keeping the state, or is it more about per-operation overhead (e.g. safe arithmetic vs unsafe arithmetic)?

jemc avatar Sep 08 '20 16:09 jemc

@jemc the simple reason is that it now uses siphash as does string and array hashing (it also did before but was using a C implementation in libponyrt). Previously numbers where hashed by this small function:

    var x = u64()
    x = (not x) + (x << 21)
    x = x xor (x >> 24)
    x = (x + (x << 3)) + (x << 8)
    x = x xor (x >> 14)
    x = (x + (x << 2)) + (x << 4)
    x = x xor (x >> 28)
    x = x + (x << 31)
    x

I am unable to comment on the hashing quality of this. Whereas now it does full siphash 2-4 which consists of 6 rounds of this: https://github.com/ponylang/ponyc/compare/hashing-builtin#diff-3fcbdccf4a6378f88bdb211e3838307bR272-R286 and some more xoring and stuff.

As all hashing should operate on the same HashState, i didnt dare to mess with the given tuples and basically do the same as above on, say the first element of the tuple. I am not experienced enough to judge on what i am doing to the hashing quality of the used siphash in that case.

mfelsche avatar Sep 11 '20 18:09 mfelsche

Let me argue in favor of this RFC in case it comes to voting and i am absent:

So, while we hit a 2x performance penalty on hashing numbers, we gain a little on strings (i think because of llvm optimizations which might not be there in the compiled runtime), but most importantly we gain a means to do composite hashing of classes that has good quality, is not ad-hoc and error-prone, that is easy to use and teach, and, most importantly to me, it enables us to implement something like value classes or case classes (i can elaborate on that if necessary), with some simple syntax sugar.

mfelsche avatar Sep 11 '20 18:09 mfelsche

Given that HashMap and other collection types allow specifying a custom hash function, I'm tentatively okay with the idea of moving ahead with this RFC despite the potential performance hit for number-hashing cases - those applications for which this turns out to be a problem could use a more simplified / weaker number hashing function if they found that to be appropriate.

However, if anyone on hand feels qualified to comment, I'd love to figure out if we could somehow get the best of both worlds, and do something like this thing that you said you didn't dare to do, due to lack of knowing how hashing quality would be affected:

As all hashing should operate on the same HashState, i didnt dare to mess with the given tuples and basically do the same as above on, say the first element of the tuple. I am not experienced enough to judge on what i am doing to the hashing quality of the used siphash in that case.

I personally would not be qualified to make a decision like this either, without more time spent researching and understanding the topic.

jemc avatar Sep 11 '20 22:09 jemc

@Theodus @sylvanc can either of you weigh in on this?

@Theodus I'm not sure if you know much about hasing strategies, but I think you might.

SeanTAllen avatar Sep 15 '20 18:09 SeanTAllen

Based on discussion during sync, I think that the "how to teach it" needs to be updated to include documenting how to change the hash function for a given object so that if the number hashing is an issue (or hashing in general is an issue).

I'm not sure what that should be. It might be part of stdlib documentation, it might be part of the tutorial. But I think it is important as part of the change given possible performance or quality impacts.

I'd like to see this change made to the RFC and fleshed out before moving this to final comment period,

SeanTAllen avatar Sep 15 '20 18:09 SeanTAllen

Something else to help mitigate the performance issue that we discussed in the sync, which can also be used to help with teaching efforts:

We could add a new standard library package that includes something like SimpleHash and SimpleHashIs (or whatever other name we choose), which will use the old logic for number hashing (and by extension, object identity hashing), which anyone who observes a performance regression could easily switch to using in their code. And our teaching example could show how to use this package if you want to use the "simple hash" logic.

Note that the SimpleHash function should be able to be implemented as a primitive with a type parameter, using iftype on the type argument to switch between blocks of code based on which integer type was used.

jemc avatar Sep 15 '20 18:09 jemc

I had a look at how rust does it. There the hash method takes a trait Hasher as argument. Rust's default implementation uses SipHash 2-4 https://doc.rust-lang.org/std/hash/struct.SipHasher.html as we do. The key advantage of Siphash is that it can be initiated with a random key and thus safeguard against hash flooding attacks by hashing to different values depending on the initial random key used for the hasher. Rust addresses the problem of slow hashing on numbers and in general short sequences of bytes by making it possible to switch the Hasher to use when creating a HashMap which then passes it into the hash method of the thing to create a hash from. Rust namely advertises FNV to be used for hashing numbers. See here for some performance comparisons: https://cglab.ca/~abeinges/blah/hash-rs/

I would honestly just do roughly the same, and have the Hasher as a type parameter that defaults to SipHash. As discussed above, we should add a SipHasher and a FNVHasher and clearly document both in stdlib and in the tutorial when to use which. I am not sure, if we can have those hashers (that define their own state to drag along through calls of hash or hash_with) be primitives, but I would try to design it that way if possible.

I am sorry for not being able to participate in a tighter and more regular frequency.

mfelsche avatar Nov 03 '20 08:11 mfelsche