internment icon indicating copy to clipboard operation
internment copied to clipboard

In Intern::new compute the hash before acquiring the mutex

Open stepancheg opened this issue 4 years ago • 10 comments
trafficstars

So there's smaller lock contention.

This also unlocks an option to do partitioning by hash, which should reduce contention greatly.

stepancheg avatar Nov 15 '21 07:11 stepancheg

Just a check: can you confirm that you've run benchmarks and this makes a difference? How big a difference does it make, and under what kind of workload?

droundy avatar Nov 15 '21 15:11 droundy

No I didn't do benchmark, sorry. It is not trivial to do testing with unreleased versions in our repo setup.

But there are more reasons to do this change. Another one is this: current implementation relies on Borrow. So it is possible to intern without allocation for example of String from str, but it is not possible to do interning of:

struct Pair(String, String)

from

struct PairRef(&str, &str)

because it is not possible to borrow Pair to PairRef. With switching away from HashSet to RawTable it is possible.

stepancheg avatar Nov 15 '21 16:11 stepancheg

I reimplemented the library in our project with two PRs I submitted and more changes.

This is the implementation. https://gist.github.com/stepancheg/9d2ebac23b27ff8d21a1bcf494b5ac7c

The speedup in our program (not just internment) is about 1% (edit). However, the speedup is against version 0.4.0.

stepancheg avatar Nov 15 '21 18:11 stepancheg

I'm wondering why https://docs.rs/hashbrown/0.11.2/hashbrown/hash_map/struct.HashMap.html#method.raw_entry_mut wouldn't have worked for this purpose? I don't like that the raw API describes itself as "unsafe and experimental". It seems like raw_entry_mut would give the advantages you describe. Am I missing something?

droundy avatar Nov 16 '21 01:11 droundy

raw_entry_mut should probably work. I didn't know such function exists.

stepancheg avatar Nov 16 '21 02:11 stepancheg

Okay I'm now thinking that this implementation could open users of internment up to hash collision attacks, since it uses a hash with DefaultHasher rather than the hasher for the HashMap, which means the RandomState is predictable and an attached who can generate values that are interned could generate many values with collisions. There are various use cases for internment that I could imagine where this attack path might be open. On the other hand, one might point out that since Intern leaks memory, it's not safe to let attackers generate unlimited Intern values in any case... Not sure if this matters, but I'm reluctant to sidestep the protections built into the std HashMap.

droundy avatar Nov 24 '21 13:11 droundy

I've confirmed from the docs that DefaultHasher::new() generates a SipHasher with zero for the keys, which means we would be vulnerable to DOS attacks. Probably your shouldn't let potential attackers provide data for internment anyways, but I'm also not comfortable forgoing the standard protection provided by the standard library.

I'll think about how to store a RandomState outside a Mutex so we can compute hashes securely before taking the lock.

droundy avatar Nov 25 '21 03:11 droundy

I'll think about how to store a RandomState outside a Mutex

static RANDOM_STATE: OnceCell<RandomState> = ... should not have noticeable performance overhead, no?

stepancheg avatar Nov 25 '21 06:11 stepancheg

Probably, and that's probably how I'd do this. If you want to give it a shot it might happen sooner.

David Roundy

On Wed, Nov 24, 2021, 10:20 PM Stepan Koltsov @.***> wrote:

I'll think about how to store a RandomState outside a Mutex

static RANDOM_STATE: OnceCell<RandomState> = ... should not have noticeable performance overhead, no?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/droundy/internment/pull/28#issuecomment-978872823, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABBSKPMG27D62EUC7ER4TLUNXISNANCNFSM5IA542FA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

droundy avatar Nov 29 '21 17:11 droundy

BTW, the reason I'm picky about optimization changes is that I've been bitten in the past by "optimizations" which ended up introducing serious bugs (only discovered months later after they had corrupted many people's data), without even a demonstrated benefit. I doubt that's the case here (among other things, internment is unlikely to affect anyone's on-disk format), but want to ensure that I understand the trade-offs, and that we don't sacrifice my ability to fix any bugs either in this pull request, or that arise in the future.

droundy avatar Nov 29 '21 17:11 droundy