StatsBase.jl
StatsBase.jl copied to clipboard
document and update default choice between :dict and :radixsort
Add potential reasons to choose :dict
over :radixsort
as a counting algorithm to documentation, and update the default choice for small inputs.
Addresses #517
Hold pending https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/51
After https://github.com/JuliaLang/julia/pull/44230 it makes sense to change the default behavior to use default sort, which also changes this PR, so now hold pending Julia-1.9.0.
Do you want to update this now that https://github.com/JuliaLang/julia/pull/44230 has been merged?
Here are some benchmarks that informed my commentary:
A value V > 1 indicates that :dict is V times faster than :radixsort. A value -V > 1 indicates that :radixsort is V times faster than :dict.
Benchmarking code:
julia> using BenchmarkTools, StatsBase, Plots
julia> function f(rng,len,alg)
evals = max(1, 10^4÷(len+5))
samples = max(1, 10^7÷(evals*(len+5)))
@belapsed countmap(x; alg=$alg) setup=(x=rand(1:$rng, $len)) evals=evals samples=samples gcsample=false gctrial=false
end
f (generic function with 1 method)
julia> function f(rng, len)
len >= 10^8 && rng >= 10^8 && return NaN # it would take too long
log(f(rng,len,:radixsort)/f(rng,len,:dict))
end
f (generic function with 2 methods)
julia> lengths = 10 .^ (0:8); ranges = 10 .^ (0:10);
julia> @time data = [f(r,l) for l in lengths, r in ranges];
78.848502 seconds (263.01 M allocations: 135.172 GiB, 7.51% gc time, 2.73% compilation time)
julia> data2 = [copysign(exp(abs(x)), x) for x in data];
julia> heatmap(ranges, lengths, data2, xscale=:log10, yscale=:log10, xlabel="range", ylabel="length")
Separately, the use of SortingAlgorithms.RadixSort causes increased complexity because we have to work around its bugginess and lack of generality. It would be better to switch to Base's default sorting algorithm and also not call the algorithm :radixsort, but simply :sort. That's for another day, though.
Thanks. So why did you revert the check that the length is higher than 100? Isn't :dict
faster in that case?
Separately, the use of SortingAlgorithms.RadixSort causes increased complexity because we have to work around its bugginess and lack of generality. It would be better to switch to Base's default sorting algorithm and also not call the algorithm :radixsort, but simply :sort. That's for another day, though.
Actually we should probably do this soon as it's needed to move these functions to Statistics (https://github.com/JuliaStats/Statistics.jl/issues/87#issuecomment-1703095224).
it used to be a 100x performance loss to use radixsort on small sizes, now it's only about 1.5x, not worth the added complexity and runtime variability imo.
Ah OK, you mean that thanks to https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/63 we use the radix sort implementation from Base now.