StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

document and update default choice between :dict and :radixsort

Open LilithHafner opened this issue 2 years ago • 2 comments

Add potential reasons to choose :dict over :radixsort as a counting algorithm to documentation, and update the default choice for small inputs.

Addresses #517

LilithHafner avatar Oct 01 '21 03:10 LilithHafner

Hold pending https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/51

LilithHafner avatar Oct 22 '21 21:10 LilithHafner

After https://github.com/JuliaLang/julia/pull/44230 it makes sense to change the default behavior to use default sort, which also changes this PR, so now hold pending Julia-1.9.0.

LilithHafner avatar Apr 05 '22 03:04 LilithHafner

Do you want to update this now that https://github.com/JuliaLang/julia/pull/44230 has been merged?

nalimilan avatar Sep 02 '23 16:09 nalimilan

Here are some benchmarks that informed my commentary:

Screenshot 2023-09-02 at 2 54 57 PM

A value V > 1 indicates that :dict is V times faster than :radixsort. A value -V > 1 indicates that :radixsort is V times faster than :dict.

Benchmarking code:
julia> using BenchmarkTools, StatsBase, Plots

julia> function f(rng,len,alg)
           evals = max(1, 10^4÷(len+5))
           samples = max(1, 10^7÷(evals*(len+5)))
           @belapsed countmap(x; alg=$alg) setup=(x=rand(1:$rng, $len)) evals=evals samples=samples gcsample=false gctrial=false
       end
f (generic function with 1 method)

julia> function f(rng, len) 
           len >= 10^8 && rng >= 10^8 && return NaN # it would take too long
           log(f(rng,len,:radixsort)/f(rng,len,:dict))
       end
f (generic function with 2 methods)

julia> lengths = 10 .^ (0:8); ranges = 10 .^ (0:10);

julia> @time data = [f(r,l) for l in lengths, r in ranges];
 78.848502 seconds (263.01 M allocations: 135.172 GiB, 7.51% gc time, 2.73% compilation time)

julia> data2 = [copysign(exp(abs(x)), x) for x in data];

julia> heatmap(ranges, lengths, data2, xscale=:log10, yscale=:log10, xlabel="range", ylabel="length")

Separately, the use of SortingAlgorithms.RadixSort causes increased complexity because we have to work around its bugginess and lack of generality. It would be better to switch to Base's default sorting algorithm and also not call the algorithm :radixsort, but simply :sort. That's for another day, though.

LilithHafner avatar Sep 02 '23 19:09 LilithHafner

Thanks. So why did you revert the check that the length is higher than 100? Isn't :dict faster in that case?

Separately, the use of SortingAlgorithms.RadixSort causes increased complexity because we have to work around its bugginess and lack of generality. It would be better to switch to Base's default sorting algorithm and also not call the algorithm :radixsort, but simply :sort. That's for another day, though.

Actually we should probably do this soon as it's needed to move these functions to Statistics (https://github.com/JuliaStats/Statistics.jl/issues/87#issuecomment-1703095224).

nalimilan avatar Sep 03 '23 14:09 nalimilan

it used to be a 100x performance loss to use radixsort on small sizes, now it's only about 1.5x, not worth the added complexity and runtime variability imo.

LilithHafner avatar Sep 03 '23 14:09 LilithHafner

Ah OK, you mean that thanks to https://github.com/JuliaCollections/SortingAlgorithms.jl/pull/63 we use the radix sort implementation from Base now.

nalimilan avatar Sep 03 '23 15:09 nalimilan