Performance discrepancy between `get!` and `haskey` + `setindex!`
MWE (tested on Julia 1.12.2):
julia> function foo1(vs::AbstractVector{Int}, ks::AbstractVector{Int})
d = Dict{Int, Int}()
for (k, v) in zip(ks, vs)
get!(d, k, v)
end
d
end
foo1 (generic function with 1 method)
julia> function foo2(vs::AbstractVector{Int}, ks::AbstractVector{Int})
d = Dict{Int, Int}()
for (k, v) in zip(ks, vs)
haskey(d, k) || setindex!(d, v, k)
end
d
end
foo2 (generic function with 1 method)
julia> ks = rand(1:200, 1000);
julia> vs = rand(1:500, 1000);
julia> using BenchmarkTools
julia> @benchmark foo1($vs, $ks)
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
Range (min … max): 4.733 μs … 1.276 ms ┊ GC (min … max): 0.00% … 99.29%
Time (median): 6.750 μs ┊ GC (median): 0.00%
Time (mean ± σ): 8.594 μs ± 26.948 μs ┊ GC (mean ± σ): 15.13% ± 8.56%
█▅▄▁ ▃▆▅▄▂▂▂▁▁▁▄▄▃▂▂▃▃▂▂▁▁▁ ▁▁ ▂
█████████████████████████████████▇▇█▆▇▆▅▇▆▆▅▅▄▅▄▅▄▁▄▄▄▄▄▁▃ █
4.73 μs Histogram: log(frequency) by time 23.5 μs <
Memory estimate: 23.15 KiB, allocs estimate: 17.
julia> @benchmark foo2($vs, $ks)
BenchmarkTools.Trial: 10000 samples with 8 evaluations per sample.
Range (min … max): 3.200 μs … 1.442 ms ┊ GC (min … max): 0.00% … 99.50%
Time (median): 3.775 μs ┊ GC (median): 0.00%
Time (mean ± σ): 6.035 μs ± 23.060 μs ┊ GC (mean ± σ): 14.87% ± 8.71%
█▅▄▄▃▁▁▅▅▂ ▁▄▄▄▃▁▁▁▁ ▁ ▁▁▁▁ ▂
██████████████████████████▇████████▇▇▇▆▅▅▄▅▃▁▄▅▃▄▅▄▃▃▁▄▄▃▄ █
3.2 μs Histogram: log(frequency) by time 19 μs <
Memory estimate: 23.17 KiB, allocs estimate: 17.
~Yes that's expected, and one of the motivations for get!. It allows to hash the key (and retrieve its position in the underlying vector) only once.~
EDIT: Oups, read to fast on the phone!
The idea with get! is that hashing should only be done once, but the benchmark says get! is slower than haskey || setindex!.
And by closer inspection, get! is defined as:
get!(t::AbstractDict, key, default) = get!(() -> default, t, key)
function get!(default::Callable, t::AbstractDict{K,V}, key) where K where V
key = convert(K, key)
if haskey(t, key)
return t[key]
else
return t[key] = convert(V, default())
end
end
So, in this case the hash is computed twice anyway.
No, that's not correct - that method is the fallback definition for AbstractDict, but Dict has an optimised implementation for get! which only hashes once.
From what I can see, the issue is that get! uses ht_keyindex2_shorthash!, which does more work than the ht_keyindex used by haskey, in order to not have to repeat the hashing etc. in the subsequent setindex!. This makes it slower, and also makes it not inline. In your example case, most integers are already present, so you're comparing essentially comparing ht_keyindex directly to ht_keyindex2_shorthash!.
Also worth noting that integer hashing is very fast, and unlikely to be a large fraction of the runtime of the two foo functions.
That would have been nice, however
julia> d = Dict{Int,Int}()
Dict{Int64, Int64}()
julia> @which get!(d, 2, 3)
get!(t::AbstractDict, key, default)
@ Base abstractdict.jl:562
julia> VERSION
v"1.12.2"
Which calls get!(() -> 3, d, 2) which calls into the dict.jl implementation.
julia> @which get!(() -> 3, d, 2)
get!(default::Union{Function, Type}, h::Dict{K, V}, key::K) where {K, V}
@ Base dict.jl:452
Ah, I misread the output from @less
Possible actionable points:
- Remove the double hashing in the
get!method. Double hashing occurs if the dict is concurrently mutated - instead, throw an exception since we don't support that. - Refactor
ht_keyindex2_shorthash!to not unconditionally include rehashing code. This can then be invoked atsetindex!if necessary - right now, the rehash code is included twice. - Improve effects on
ht_keyindex2_shorthash!using@assume_effects... not sure what this would do, actually.
The goal would be to reduce the code size of ht_keyindex2_shorthash! to make it inline, and to defer any mutating work to _setindex, since it may even need to be called.
I'm not convinced it's really worth it but if someone who really wants to review dict performance it may be of inferest.
Remove the double hashing in the get! method. Double hashing occurs if the dict is concurrently mutated - instead, throw an exception since we don't support that.
I don't think that is true, see https://github.com/JuliaLang/julia/issues/7944, https://github.com/JuliaLang/julia/issues/9573.