StatsBase.jl icon indicating copy to clipboard operation
StatsBase.jl copied to clipboard

optimized `sample!(replace=false)`for ranges?

Open stevengj opened this issue 3 years ago • 0 comments

As discussed on discourse, you can do much better than StatsBase.sample! for sampling ranges without replacement if you are only drawing a small number of samples.

In particular, the following is about 5x faster than sample!(1:3500, x, ordered=true, replace=false) on my machine for x = zeros(Int, 27):

function seqsample_trivial!(rng::AbstractRNG, a::AbstractRange, x::AbstractArray)
    k = length(x)
    while true
        sort!(rand!(rng, x, a), rev=step(a)<0)
        allunique = true
        for i = 2:k
            if x[i] == x[i-1]
                allunique = false
                continue
            end
        end
        allunique && return x
    end
end

The crossover point to use the trivial algorithm should be based on a birthday problem: k < sqrt(-2n*log(p)) where k=length(x), n=length(a), and p is some desired probability of succeeding on the single iteration of the while true loop above. Say p=0.5 from the benchmarks below, corresponding roughly to 10k^2 < 14n.

~~For the problem above of sampling 1:3500, the crossover point on my machine is about k ≈ 200, which corresponds to p ≈ 0.003 — it is faster even though it typically has to re-sample a over 200 times! This was pretty surprising to me — should sample! be so inefficient?~~ Oh, nevermind, the problem is that @btime reports the minimum time by default. When I look correctly at the mean time, the crossover point is at k≈70, which is about p ≈ 0.5, and that seems quite reasonable.

stevengj avatar Mar 23 '21 15:03 stevengj