HashTable grows even though capacity is not exceeded
HashTable::with_capacity docs say:
The hash table will be able to hold at least
capacityelements without reallocating.
The following test shows that this statement is incorrect:
#[test]
fn foo() {
let capacity = 100;
let mut hash_table = HashTable::with_capacity(capacity);
for i in 0u64..1_000 {
dbg!(i);
assert!(hash_table.len() < capacity);
let hasher =
|_: &_| unreachable!("hasher won't be called because there is capacity left");
let eq = |j: &u64| *j == i;
let Entry::Vacant(entry) = hash_table.entry(i, eq, hasher) else {
unreachable!("actually unreachable");
};
entry.insert(i);
if hash_table.len() == capacity {
hash_table.find_entry(i, eq).unwrap().remove();
}
}
}
Expectation:
As the assert shows, every time the loop begins there is at least one capacity left. The HashTable does not need to grow and it will not call our hasher in HashTable::entry.
Reality: The HashTable attempts to grow and calls the hasher, triggering the unreachable. The debug prints tell us that this happens at i=112.
This is a problem because it prevents you from writing code that relies on HashTable not growing when enough capacity has been allocated. You should be able to write such code.
This is because remove() may leave a DELETED marker, which reduces the reported capacity(). Your assertion doesn't reflect this since you're comparing to a fixed capacity value.
I proposed a cleanup method in #255, but you would still need a proper hasher to use that.
There are two issues here.
- The documentation of with_capacity is wrong. The test HashTable never holds more elements than capacity, yet it reallocates.
- Inserting when capacity is left might move elements around unexpectedly.
There should be a way to initialize HashTable that ensures the table will not grow unless you exceed the initially given size. This is an important feature of collections. Currently with_capacity is buggy/misleading because it implies that guarantee but it does not actually uphold it. All std collections' with_capacity method guarantee this. If HashTable needs more space than the given capacity, then it is free to internally allocate more in with_capacity. It is not free to allocate this later.
I'm unsure whether there should be a guarantee for not moving elements around when there is capacity. I assumed HashMap worked this way but I see now it is not stated anywhere. You probably don't want to guarantee this to get more freedom with the implementation.
Are deletions leaving unusable entries, that can only be reused by rehashing the entire table, the only reason that capacity can go down?
Double insert for the same key leads to the same issue, but seemingly only when capacity is equal to length:
use std::collections::HashMap;
const PROBE_SIZE: usize = 7;
fn main() {
let mut map = HashMap::with_capacity(PROBE_SIZE);
for i in 0..PROBE_SIZE {
map.insert(i, 0);
map.insert(i, 0);
eprintln!(
"i = {i}, len = {}, capacity = {}",
map.len(),
map.capacity()
);
}
}
i = 0, len = 1, capacity = 7
i = 1, len = 2, capacity = 7
i = 2, len = 3, capacity = 7
i = 3, len = 4, capacity = 7
i = 4, len = 5, capacity = 7
i = 5, len = 6, capacity = 7
i = 6, len = 7, capacity = 14
I think the issue is that the hashtable does not check whether the key is already present before deciding to grow. The following only does the duoble insert for the last iteration and has the same output as your own:
use std::collections::HashMap;
const PROBE_SIZE: usize = 7;
fn main() {
let mut map = HashMap::with_capacity(PROBE_SIZE);
for i in 0..PROBE_SIZE {
map.insert(i, 0);
if i == PROBE_SIZE - 1{
map.insert(i, 0);
}
eprintln!(
"i = {i}, len = {}, capacity = {}",
map.len(),
map.capacity()
);
}
}
Right, it happens here:
- https://github.com/rust-lang/hashbrown/blob/v0.15.1/src/raw/mod.rs#L1212
I'm not really sure how to make it work out of the box without incurring performance penalties, but it seems like a good behavior to uphold.
Like Abseil original design, Hashbrown uses tombstones and an anti-drift mechanism. The idea is to avoid performance degradation by forcing a grow + rehash when too much have accumulated. In practice, the map tries very hard to minimize tombstone usage/maximize re-use. But when one is set, it reduces its capacity by 1 to eventually trigger that rehash (on next insert).
This is a bit counter-intuitive to the principle of least surprise for a standard container, I'd say. There are others SIMD-based designs that exist and don't use tombstones, but they are usually a little bit slower (or use 2-bytes of metadata per entry instead of 1).
This is also an issue with shrink_to, possibly because it uses with_capacity internally here:
https://github.com/rust-lang/hashbrown/blob/64aa7d5343478efa9f4bd4b0f7de56110f6da860/src/raw/mod.rs#L918
After calling shrink_to, the hashmap can grow even when the length doesn't exceed the min_capacity parameter.