aHash icon indicating copy to clipboard operation
aHash copied to clipboard

std HashMap with aHash significantly slower than hashbrown HashMap

Open hniksic opened this issue 3 years ago • 4 comments
trafficstars

Not sure where this should be reported, but I figured this crate might be a good starting point. Please let me know if it doesn't fit.

While looking into an unrelated task, I found a significant difference in performance between AHashMap and hashbrown::HashMap. I used this code to measure it:

fn main() {
    let hm: ahash::AHashMap<i32, i32> = (0..1_000_000).map(|i| (i, i)).collect();
    let t0 = std::time::Instant::now();
    let mut sum = 0;
    for i in 0..1_000_000 {
        if let Some(x) = hm.get(&i) {
            sum += x;
        }
    }
    let elapsed = t0.elapsed().as_secs_f64();
    println!("{} - The sum is: {}", elapsed, sum);
}

On my machine it reports taking 66 ms or so. Now change ahash::AHashMap to hashbrown::HashMap and the reported runtime drops to 35 ms, i.e. hashbrown it's almost 2x faster.

This would normally not be a big deal, except that the two should, as I understand it, be the same thing. AHashMap<i32, i32> is a thin wrapper over std::collections::HashMap<i32, i32, ahash::RandomState>. And since stdlib uses hashbrown by default, and hashbrown defaults to aHash, hashbrown::HashMap<i32, i32> should be the same thing under the hood. So why is there a two-fold difference in performance?

More importantly, is there a way to get the performance of hashbrown::HashMap without pulling in the whole crate? (Since the implementation is the same, it feels like it should be enough to only pull this crate to get the appropriate hash function.)

Note that the performance of stdhashmap/ahash combo doesn't change if I replace the use of AHashMap wrapper with direct use of std::collections::HashMap<i32, i32, ahash::RandomState>. Likewise, the performance of hashbrown doesn't change when I explicitly specify hashbrown::HashMap<i32, i32, ahash::RandomState>.

hniksic avatar Jan 03 '22 22:01 hniksic

Given that the performance doesn't change when explicitly initializing the hasher, it seems like the perf difference is not coming from the hasher but a difference between the Hashmaps. I can think of two possible reasons for this:

  • Hashbrown has made improvements. Because std tends to lag behind what's in their repo by quite a bit, it could be that improvements have been made that just aren't in std yet.
  • std and your hashbrown crates were compiled for different CPU targets. Hashbrown uses simd instructions to do lookups. It may be that the version of std you are using was compiled to a more generic cpu architecture and as a result is using less optimal instructions than you get when building Hashbrown directly.

If either or both of those are the case, there isn't really much ahash can do about it.

tkaitchuck avatar Feb 11 '22 03:02 tkaitchuck

Hashbrown has made improvements

That occurred to me, but it is my understanding that the stdlib HashMap simply uses the one from hashbrown, i.e. that their implementations can't diverge because one uses the other.

std and your hashbrown crates were compiled for different CPU targets

I'd expect crates that involve generics to be compiled only when used, due to monomorphization. But maybe something in how std configures hashbrown differs from its default configuration when I use it directly. Not sure how to check this.

hniksic avatar Feb 13 '22 18:02 hniksic

Hashbrown has made improvements

That occurred to me, but it is my understanding that the stdlib HashMap simply uses the one from hashbrown, i.e. that their implementations can't diverge because one uses the other.

Might the stdlib specify a version constraint for hashbrown that does not include the latest version?

mpfaff avatar Feb 18 '22 16:02 mpfaff

Testing this locally I don't see the same effect that you do. I went through several ways to create a hashMap with aHash and observed a significant performance difference, but on my computer the AHashMap wrapper was the fastest. I suspect the issue has something to do with the particulars on the benchmark. For example I noticed that when I changed the number of items in the map to "1_000" but left the loop counter at "1_000_000" so that it would be most testing access speed rather than insertion speed, a couple of times it went impossibly fast. I think somehow the whole of the benchmark was getting inlined and it was optimizing out all the lookups.

tkaitchuck avatar Feb 27 '22 05:02 tkaitchuck