frozen icon indicating copy to clipboard operation
frozen copied to clipboard

Is elsa underspecified?

Open cbeck88 opened this issue 6 years ago • 2 comments

There are some things in the PMH routines that are a little confusing for me right now. Mainly:

  1. What is the data-type that we should use for seeds to elsa?
  2. What is the data-type that we should use for hashes obtain from elsa?

We are now using randomized seeds, obtained from PRG's. In general a user-provided PRG might produce any number of bits on a single draw, it might produce uint8_t, uint16_t, uint32_t etc. But we seem to typically represent such draws as uint64_t, and pass them to elsa this way.

This also means that changing the PRG might degrade the performance of a hash function, only because it happens to produce smaller outputs, not because of poor statistical properties.

One thing we could do is use std::independent_bits_engine or similar to wrap the user-provided PRG. http://en.cppreference.com/w/cpp/numeric/random/independent_bits_engine

We could always use this engine and make it produce 64-bit seeds. Then if the user gives us a PRG that produces 32 bit draws, the independent_bits_engine will invoke it twice to produce a 64 bit number. If the user-provided PRG produces 64 bit draws, then we transparently take those.

Another thing we could do is allow the user-provided elsa to request a specific seed type. It could provide a type-def seed_type, or a static constexpr unsigned num_seed_bits = ..., or both, and we could make the independent_bits_engine provide it seeds with that many random bits.


  1. I didn't implement the multiply-shift hasher again in master yet, in part because there was a part of how to do it that I think we should discuss.

One of the topics discussed in multiply-shift literature is that, the high order bits have much better statistical properties than the low order bits. It was recommended that if you use this hash function and you need exactly m bits, you should use >> to shift them down when you return.

This >> operation becomes free, because within the pmh algorithm, we take the result of user-provided hash and apply % M where M = 2^m. If the user-provided hash gets inlined, the compiler can see that the % M is redundant after the shifting has occurred, so we save an operation.

However, we can't actually do this in a "generic" user-provided elsa, because elsa doesn't have a way to know exactly how many bits m we care about. It depends on the size of the container.

In #24 the way I worked around this is, the multiply-shift hasher became an internal detail, a "wrapper" that gets applied over the user-provided hash function, and this only happens after we already know m, so m can be a template parameter.

It might be better to come up with a new spec for elsa so that if the user wants to make custom hash functions that have access to that information, they can do it without modifying the library. And there won't be as much tricky stuff in our code then either. (I now think that the way I originally did it was a bit ugly.)

I'm not sure if I have a specific proposal right now though, it's something we can think about.

One thing we could do is allow the user to have two forms of elsa:

Basic form (backwards compatible):

template <>
struct elsa<foo> {
  constexpr uint64_t operator()(const foo &, uint64_t seed) const { ... };
};

Extended form:

template <>
struct elsa<foo> {
  // These are directives for pmh implementation:
  using seed_type = uint64_t;
  static constexpr auto seed_bits_requested = 64;

  // Could also be e.g.
  // using seed_type = uint_fast_32_t;
  // static constexpr auto seed_bits_requested = 32;

  // template parameter Config contains extra info the pmh
  // implementation has to share with the hash function
  template <typename Config>
  constexpr uint64_t hash(const foo &, seed_type seed) const {
    // Now advanced user can use e.g. Config::num_bits_needed
    // to implement optimized hash functions with extra data at
    // compile-time, and we can stick more things in Config later
    // without breaking the API
  }
};

The idea then would be that frozen can use SFINAE to detect if the basic form is available in Hasher, and if not, then try to use the extended form. Or, more likely, we can transparently wrap the basic form and put an implementation of the advanced form around it, without bothering novice user. (That would be similar to the AdaptedHash stuff in #24, particularly this commit: https://github.com/serge-sans-paille/frozen/pull/24/commits/669140beb8fd36394f6fc88e09a7832c94f34893 But at least such SFINAE magic would be limited in scope to reconciling the basic and extended form, and not actually tied directly to hash function implementation details like MultiplyShift.)

Let me know if you think this is a good direction!

Another thing we could do instead is pass Config by value, if it turns out to be an empty struct then gcc and clang will eliminate it during optimization -- that's basically an API design / preference question. There may be some reason to do it that way instead of what I wrote above, that I just haven't thought of yet.

cbeck88 avatar Dec 27 '17 18:12 cbeck88

As of 1) I see this as an internal detail and I'm not sure the user should be able to control that. What makes you think it should?

The hashes obtained from elsa are used to index an array, so your move from uint64_t to size_t looks good to me. I don't see why you would need to specialize elsa. Just providing a custom hasher that permforms multiply shift would be enough, right?

I have this strange feeling I'm missing your point...

serge-sans-paille avatar Dec 27 '17 21:12 serge-sans-paille

Hi, sorry that I took a long time to respond, I've been distracted. The short version is, it isn't possible to implement multiply shift hashing correctly unless you know how many bits you are supposed to output. You do the multiply, and then the high order bits have the best statistical properties, so you shift them down to produce the result. If you don't know m at the time you do the hashing, you can't do the shift part. We could just do multiply hashing but I expect it wouldn't work very well. We could try to swap all the bits from highest to lowest before we return, but that's kind of janky and I would expect it to hurt performance significantly.

I can write again later about other ideas. I think the interface needs to develop somewhat if we want people to be able to get the best results.

cbeck88 avatar Jan 08 '18 05:01 cbeck88