StatsBase.jl
StatsBase.jl copied to clipboard
optimized `sample!(replace=false)`for ranges?
As discussed on discourse, you can do much better than StatsBase.sample!
for sampling ranges without replacement if you are only drawing a small number of samples.
In particular, the following is about 5x faster than sample!(1:3500, x, ordered=true, replace=false)
on my machine for x = zeros(Int, 27)
:
function seqsample_trivial!(rng::AbstractRNG, a::AbstractRange, x::AbstractArray)
k = length(x)
while true
sort!(rand!(rng, x, a), rev=step(a)<0)
allunique = true
for i = 2:k
if x[i] == x[i-1]
allunique = false
continue
end
end
allunique && return x
end
end
The crossover point to use the trivial algorithm should be based on a birthday problem: k < sqrt(-2n*log(p))
where k=length(x)
, n=length(a)
, and p
is some desired probability of succeeding on the single iteration of the while true
loop above. Say p=0.5
from the benchmarks below, corresponding roughly to 10k^2 < 14n
.
~~For the problem above of sampling 1:3500
, the crossover point on my machine is about k ≈ 200
, which corresponds to p ≈ 0.003
— it is faster even though it typically has to re-sample a
over 200 times! This was pretty surprising to me — should sample!
be so inefficient?~~ Oh, nevermind, the problem is that @btime
reports the minimum time by default. When I look correctly at the mean time, the crossover point is at k≈70
, which is about p ≈ 0.5
, and that seems quite reasonable.