icu4x
icu4x copied to clipboard
Make a zero-copy hash map
In the zerovec crate, we currently have ZeroMap, which does a binary-search lookup through the keys. This results in small code size and compact data layout, but it doesn't scale to maps with hundreds or thousands of keys. Moving forward, it would be good to have a ZeroHashMap in our toolbox as an option for when ZeroMap is a bottleneck.
Context: https://unicode-org.atlassian.net/browse/CLDR-15893
Hmmm, I hadn't seen this issue before.
Hashmaps are super complicated; I'm not convinced we should be maintaining one. Do we have a strong use case yet where we have such a map?
The use case that motivated me to file this issue was that CLDR wants to add hundreds of new parent locales. The current ZeroMap stores only a couple dozen, and it will choke if the CLDR issue gets accepted. There may be other models for storing this data that don't require a HashMap but I haven't thought it through thoroughly.
https://unicode-org.atlassian.net/browse/CLDR-15893
Procedurally, I should have marked this issue as "discuss" or similar; the intent was that whoever took the issue would come up with a design proposal that we discuss in the group before moving forward with an implementation. However, I'm pleasantly surprised with the approach @pdogr cooked up.
Ah, that's quite reasonable. Are there other ways to do maps that might work? Perhaps a BTreeMap-like solution?
I suspect perfect hashing is the right solution here, but also mutation becomes far far worse for anything that's not a flat sorted map.
It may be useful to think about this as a way to optionally annotate an existing ZeroMap. For example, if you load a ZeroMap, and it has a hash table along side it, and your ZeroMap client has hashing enabled, then you can use the hash table to speed up your lookup, and otherwise you fall back to binary search. (This doesn't work if we permute the keys.)
That could work well provided we use perfect hashing which I believe is the plan.
I think a really nice property of that system is that you can unwrap such a map to the underlying ZeroMap, mutate it, and then re-wrap it.
Though ZeroMap mutation is not expected to be cheap anyway, binary search based mutation will definitely be cleaner.
Outstanding work tracked in #3128