julia icon indicating copy to clipboard operation
julia copied to clipboard

Performance discrepancy between `get!` and `haskey` + `setindex!`

Open frankwswang opened this issue 2 weeks ago • 8 comments

MWE (tested on Julia 1.12.2):

julia> function foo1(vs::AbstractVector{Int}, ks::AbstractVector{Int})
           d = Dict{Int, Int}()
           for (k, v) in zip(ks, vs)
               get!(d, k, v)
           end
           d
       end
foo1 (generic function with 1 method)

julia> function foo2(vs::AbstractVector{Int}, ks::AbstractVector{Int})
           d = Dict{Int, Int}()
           for (k, v) in zip(ks, vs)
               haskey(d, k) || setindex!(d, v, k)
           end
           d
       end
foo2 (generic function with 1 method)

julia> ks = rand(1:200, 1000);

julia> vs = rand(1:500, 1000);

julia> using BenchmarkTools

julia> @benchmark foo1($vs, $ks)
BenchmarkTools.Trial: 10000 samples with 6 evaluations per sample.
 Range (min … max):  4.733 μs …  1.276 ms  ┊ GC (min … max):  0.00% … 99.29%
 Time  (median):     6.750 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   8.594 μs ± 26.948 μs  ┊ GC (mean ± σ):  15.13% ±  8.56%

  █▅▄▁ ▃▆▅▄▂▂▂▁▁▁▄▄▃▂▂▃▃▂▂▁▁▁   ▁▁                           ▂
  █████████████████████████████████▇▇█▆▇▆▅▇▆▆▅▅▄▅▄▅▄▁▄▄▄▄▄▁▃ █
  4.73 μs      Histogram: log(frequency) by time     23.5 μs <

 Memory estimate: 23.15 KiB, allocs estimate: 17.

julia> @benchmark foo2($vs, $ks)
BenchmarkTools.Trial: 10000 samples with 8 evaluations per sample.
 Range (min … max):  3.200 μs …  1.442 ms  ┊ GC (min … max):  0.00% … 99.50%
 Time  (median):     3.775 μs              ┊ GC (median):     0.00%
 Time  (mean ± σ):   6.035 μs ± 23.060 μs  ┊ GC (mean ± σ):  14.87% ±  8.71%

  █▅▄▄▃▁▁▅▅▂    ▁▄▄▄▃▁▁▁▁ ▁   ▁▁▁▁                           ▂
  ██████████████████████████▇████████▇▇▇▆▅▅▄▅▃▁▄▅▃▄▅▄▃▃▁▄▄▃▄ █
  3.2 μs       Histogram: log(frequency) by time       19 μs <

 Memory estimate: 23.17 KiB, allocs estimate: 17.

frankwswang avatar Dec 04 '25 03:12 frankwswang

~Yes that's expected, and one of the motivations for get!. It allows to hash the key (and retrieve its position in the underlying vector) only once.~ EDIT: Oups, read to fast on the phone!

rfourquet avatar Dec 04 '25 06:12 rfourquet

The idea with get! is that hashing should only be done once, but the benchmark says get! is slower than haskey || setindex!.

And by closer inspection, get! is defined as:

get!(t::AbstractDict, key, default) = get!(() -> default, t, key)
function get!(default::Callable, t::AbstractDict{K,V}, key) where K where V
    key = convert(K, key)
    if haskey(t, key)
        return t[key]
    else
        return t[key] = convert(V, default())
    end
end

So, in this case the hash is computed twice anyway.

sgaure avatar Dec 04 '25 08:12 sgaure

No, that's not correct - that method is the fallback definition for AbstractDict, but Dict has an optimised implementation for get! which only hashes once.

From what I can see, the issue is that get! uses ht_keyindex2_shorthash!, which does more work than the ht_keyindex used by haskey, in order to not have to repeat the hashing etc. in the subsequent setindex!. This makes it slower, and also makes it not inline. In your example case, most integers are already present, so you're comparing essentially comparing ht_keyindex directly to ht_keyindex2_shorthash!.

Also worth noting that integer hashing is very fast, and unlikely to be a large fraction of the runtime of the two foo functions.

jakobnissen avatar Dec 04 '25 11:12 jakobnissen

That would have been nice, however

julia> d = Dict{Int,Int}()
Dict{Int64, Int64}()

julia> @which get!(d, 2, 3)
get!(t::AbstractDict, key, default)
     @ Base abstractdict.jl:562

julia> VERSION
v"1.12.2"

sgaure avatar Dec 04 '25 12:12 sgaure

Which calls get!(() -> 3, d, 2) which calls into the dict.jl implementation.

julia> @which get!(() -> 3, d, 2)
get!(default::Union{Function, Type}, h::Dict{K, V}, key::K) where {K, V}
     @ Base dict.jl:452

KristofferC avatar Dec 04 '25 12:12 KristofferC

Ah, I misread the output from @less

sgaure avatar Dec 04 '25 12:12 sgaure

Possible actionable points:

  • Remove the double hashing in the get! method. Double hashing occurs if the dict is concurrently mutated - instead, throw an exception since we don't support that.
  • Refactor ht_keyindex2_shorthash! to not unconditionally include rehashing code. This can then be invoked at setindex! if necessary - right now, the rehash code is included twice.
  • Improve effects on ht_keyindex2_shorthash! using @assume_effects... not sure what this would do, actually.

The goal would be to reduce the code size of ht_keyindex2_shorthash! to make it inline, and to defer any mutating work to _setindex, since it may even need to be called.

I'm not convinced it's really worth it but if someone who really wants to review dict performance it may be of inferest.

jakobnissen avatar Dec 04 '25 12:12 jakobnissen

Remove the double hashing in the get! method. Double hashing occurs if the dict is concurrently mutated - instead, throw an exception since we don't support that.

I don't think that is true, see https://github.com/JuliaLang/julia/issues/7944, https://github.com/JuliaLang/julia/issues/9573.

KristofferC avatar Dec 04 '25 12:12 KristofferC