hashbrown icon indicating copy to clipboard operation
hashbrown copied to clipboard

feat: recognize and use over sized allocations

Open morrisonlevi opened this issue 1 year ago • 2 comments

Allocators are allowed to return a larger memory chunk than was asked for. If the amount extra is large enough, then the hash map can use the extra space. The Global allocator will not hit this path, because it won't over-size enough to matter, but custom allocators may. An example of an allocator which allocates full system pages is included in the test suite (Unix only because it uses mmap).

This implements #489. This relies on PR #524 to increase the minimum number of buckets for certain small types, which in turn constrains the domain of maximum_buckets_in so that the alignment can be ignored.

I haven't done any performance testing. Since this is on the slow path of making a new allocation, the feature should be doable without too much concern about overhead.

This is my first contribution to the project, and I am definitely not an expert in swiss tables. Feedback is very welcome, even nitpicking.

morrisonlevi avatar May 04 '24 21:05 morrisonlevi

Edit: this comment is out-of-date with the current implementation, left because it's interesting.


I've figured out the rough equation to do it without a loop:

fn maximum_buckets_in(allocation_size: usize, table_layout: TableLayout) -> usize {
    // Given an equation like:
    //   z >= x * y + x
    // x can be maximized by doing:
    //   x = z / (y + 1)
    // If you squint:
    //   z is the size of the allocation
    //   y is the table_layout.size
    //   x is the number of buckets
    // But there are details like x needing to be a power of 2,
    // and there are some extra bytes mixed in (a possible
    // rounding up for table_layout.align, and Group::WIDTH).
    /// todo: how do I factor in the ctrl_align?
    let z = allocation_size - Group::WIDTH;
    let y_plus_1 = table_layout.size + 1;
    prev_pow2(z / y_plus_1)
}

I'm not quite sure about the table_layout.ctrl_align. I need to think about that more. It seems like it can be ignored, but I haven't quite figured out why or the proof for it.

Edit: I tried to find a case where ignoring the ctrl_align caused a problem programmatically:

type T = (bool, ());
    let table_layout = TableLayout::new::<T>();

    let begin = {
        // there are never less than 4 buckets
        let (layout, _) = table_layout.calculate_layout_for(4).unwrap();
        layout.size()
    };

    use rayon::prelude::*;
    (begin..=(1 << 47))
        .into_par_iter()
        .for_each(|allocation_size| {
            let buckets = maximum_buckets_in(allocation_size, table_layout).unwrap();
            let (layout, _) = table_layout.calculate_layout_for(buckets).unwrap();
            let size = layout.size();
            assert!(
                size <= allocation_size,
                "failed {size} <= {allocation_size}"
            );
        });

I ran it for quite some time with different TableLayouts. No issues on any of them. I think it has to do with the relationship between rounding down to the previous power of 2, and the fact the rounding for align is always very small, in the range 0..ctrl_align.

morrisonlevi avatar May 05 '24 20:05 morrisonlevi

~I found an issue with the previous implementation with very small TableLayout sizes with large ctrl_aligns. The optimization in #524 will correct the domain of maximum_buckets_in because buckets * table_layout.size (or x * y above) will now be at least ctrl_align, so we can safely ignore it in the equation.~

Relevant PR has merged.

morrisonlevi avatar May 08 '24 04:05 morrisonlevi

I have rebased on the latest master and got the tests working.

morrisonlevi avatar Apr 21 '25 23:04 morrisonlevi

Could you write a quick benchmark to measure the added cost of allocating a new HashMap? Specifically just measuring HashMap::with_capacity(X) for different X values to see how much of an impact this has.

Amanieu avatar Apr 29 '25 14:04 Amanieu

FYI, I intend to work on this again the week of August 25-29, 2025.

morrisonlevi avatar Aug 05 '25 15:08 morrisonlevi

@Amanieu, I rebased my work, added a benchmark, and squashed my changes. This way anyone can check out the branch and compare to master by running:

# test this branch
cargo +nightly bench --bench with_capacity 

# checkout the benchmark commit which doesn't have any feature
# work on it yet.
git checkout  --detach c1ddda229df96f7b8171609f3490128d8928e041

# test (pretty much master)
cargo +nightly bench --bench with_capacity 

# switch back to this branch
git switch -c oversized-allocations

This doesn't actually return the oversized check, since it's just using the system allocator. So the overhead being measured here is "the code performs the oversized check but the check doesn't result in a larger hashtable." Here are the formatted side-by-side results on my macOS M1:

Capacity With feature (ns/iter) Reverted (master) (ns/iter) Overhead vs master
0 3.50 ± 0.23 3.54 ± 0.05 -1.1%
1 24.51 ± 0.60 21.18 ± 2.01 +15.7%
3 24.62 ± 0.45 21.38 ± 1.67 +15.2%
7 24.29 ± 0.71 21.83 ± 1.03 +11.3%
8 51.60 ± 3.30 48.44 ± 3.45 +6.5%
16 53.69 ± 2.72 51.28 ± 2.11 +4.7%
32 20.60 ± 0.96 18.62 ± 0.68 +10.6%
64 20.43 ± 0.44 19.18 ± 0.35 +6.5%
128 21.88 ± 0.40 20.57 ± 0.37 +6.4%
256 24.51 ± 1.71 25.11 ± 0.47 -2.4%
512 31.11 ± 0.44 31.26 ± 0.47 -0.5%
1,024 44.89 ± 4.70 44.21 ± 0.60 +1.5%
4,096 156.25 ± 36.71 143.55 ± 40.47 +8.9%
16,384 688.40 ± 30.35 685.02 ± 34.83 +0.5%
65,536 14,748.30 ± 1,930.86 14,053.07 ± 1,588.73 +5.0%

You can see results range from -2.4% to +15.7%. The results jive with what I would have suspected: the overhead is a bit higher on the smaller sizes because the overhead of the oversized check is roughly the same no matter the size, and for the larger sizes the other parts will dominate.

Note that for the allocators I personally have in mind, these sizes are fairly realistic. For example:

  1. Use a fixed-size of pre-allocated memory such as static memory (e.g. 64KiB).
  2. Round up to whole memory pages, which range in size from 4KiB to 16KiB depending on OS before including large pages.

I can check more sizes if you are interested, particular non-power-of-two sizes.

morrisonlevi avatar Sep 20 '25 16:09 morrisonlevi

Nice! Do you think it's worth having a fast path if the returned size is exactly the requested size? At least in the case of the global allocator this would be const-folded away due to inlining.

Amanieu avatar Sep 21 '25 23:09 Amanieu

I think so, especially as most hashbrown users in 2025 will not be using custom allocators. I added the path and now on macOS you can't tell really any difference between the master branch and this one, because it falls within the noise.

morrisonlevi avatar Sep 22 '25 21:09 morrisonlevi

@Amanieu Anything else you'd like to see from me?

morrisonlevi avatar Sep 25 '25 20:09 morrisonlevi

@Amanieu Thanks for your reviews. I've responded, please let me know if you have more questions or suggestions.

morrisonlevi avatar Oct 02 '25 03:10 morrisonlevi