icu4x icon indicating copy to clipboard operation
icu4x copied to clipboard

Make a zero-copy hash map

Open sffc opened this issue 3 years ago • 1 comments

In the zerovec crate, we currently have ZeroMap, which does a binary-search lookup through the keys. This results in small code size and compact data layout, but it doesn't scale to maps with hundreds or thousands of keys. Moving forward, it would be good to have a ZeroHashMap in our toolbox as an option for when ZeroMap is a bottleneck.

Context: https://unicode-org.atlassian.net/browse/CLDR-15893

sffc avatar Sep 08 '22 16:09 sffc

Hmmm, I hadn't seen this issue before.

Hashmaps are super complicated; I'm not convinced we should be maintaining one. Do we have a strong use case yet where we have such a map?

Manishearth avatar Sep 21 '22 14:09 Manishearth

The use case that motivated me to file this issue was that CLDR wants to add hundreds of new parent locales. The current ZeroMap stores only a couple dozen, and it will choke if the CLDR issue gets accepted. There may be other models for storing this data that don't require a HashMap but I haven't thought it through thoroughly.

https://unicode-org.atlassian.net/browse/CLDR-15893

Procedurally, I should have marked this issue as "discuss" or similar; the intent was that whoever took the issue would come up with a design proposal that we discuss in the group before moving forward with an implementation. However, I'm pleasantly surprised with the approach @pdogr cooked up.

sffc avatar Sep 23 '22 09:09 sffc

Ah, that's quite reasonable. Are there other ways to do maps that might work? Perhaps a BTreeMap-like solution?

I suspect perfect hashing is the right solution here, but also mutation becomes far far worse for anything that's not a flat sorted map.

Manishearth avatar Sep 23 '22 18:09 Manishearth

It may be useful to think about this as a way to optionally annotate an existing ZeroMap. For example, if you load a ZeroMap, and it has a hash table along side it, and your ZeroMap client has hashing enabled, then you can use the hash table to speed up your lookup, and otherwise you fall back to binary search. (This doesn't work if we permute the keys.)

sffc avatar Oct 01 '22 16:10 sffc

That could work well provided we use perfect hashing which I believe is the plan.

Manishearth avatar Oct 02 '22 22:10 Manishearth

I think a really nice property of that system is that you can unwrap such a map to the underlying ZeroMap, mutate it, and then re-wrap it.

Though ZeroMap mutation is not expected to be cheap anyway, binary search based mutation will definitely be cleaner.

Manishearth avatar Oct 02 '22 22:10 Manishearth

Outstanding work tracked in #3128

robertbastian avatar Feb 17 '23 22:02 robertbastian