Countmap is 100x slower than it should be in some cases
julia> @btime countmap(x) setup=(x=rand(Int16.(1:10), 30));
41.375 μs (6 allocations: 512.47 KiB)
julia> @btime countmap(x) setup=(x=rand(1:10, 30));
266.611 ns (5 allocations: 848 bytes)
See https://github.com/JuliaStats/StatsBase.jl/pull/339. There is a specialized method for small integer types which is faster when there are a lot of elements.
Yeah, the runtime of both methods isO(n) and the alternate method has a better multiplicative constant factor but a much worse additive constant factor which is good for lots of elements but not so good for few elements. This is a rough approximation of when each method is better. There are some methodological issues but I think it's accurate to within an order of magnitude in all cases.
Code for figure
```julia using StatsBase, Plots function linear_regression(X, Y, x) X0 = mean(X); Y0 = mean(Y) m = sum((X .- X0) .* (Y .- Y0)) / sum((X .- X0).^2) (x-X0)m + Y0 endt(T, a, b) = (x = rand(T.(1:a), b); minimum(@elapsed(countmap(x)) for _ in 1:100)) f(a, b) = t(Int16, a, b)/t(Int64, a, b) function g(a, target=1) b = 10 while f(a, b) > target b *= 10 end input_sizes = round.(Int, exp.(range(log(b/10), log(b), length=10))) time_ratios = f.(a, input_sizes) exp(linear_regression(log.(time_ratios), log.(input_sizes), log(target))) end
x = floor.(exp.(2:.2:log(typemax(Int16)))) @time y_even = g.(x) @time y_slowdown = g.(x, 3) x_speedup = vcat(x[5:5:end-3], last(x)) @time y_speedup = g.(x_speedup, 1/3)
plot(x_speedup, y_speedup, label="3x speedup") plot!(x, y_even, xlabel="range of values", ylabel="number of elements", xaxis=:log10, yaxis=:log10, xticks=7, yticks=7, label="even", title="When is the method added in #339 good?") plot!(x, y_slowdown, label="3x slowdown") ylims!(10^2, 10^7) savefig("figure.png")
Thanks for benchmarking. It could make sense to use that method only when there are more than e.g. 10_000 or 100_000 entries. Is the threshold different for Int16, Int8 and Bool?
length(x) > Base.max_values(eltype(x)) seems like it would be a reasonable threshold (though avoid using Base.max_values because it is internal)
Why not. So you mean that for Int8 the specialized method is faster for arrays longer than 256 elements (roughly)? Would you make a pull request?
I'm not eager to add heuristics and additional code complexity without robust performance analytics and unfortunately I don't have the bandwidth right now to conduct extensive benchmarking.
Given what you showed I would say that a rough heuristic is better than the current situation, but as you prefer.