Performance of samplers
Thanks to @simonbyrnes' recent efforts, we already have some in-house samplers.
I just run a systematic benchmarking of samplers implemented in this package (using BenchmarkLite.jl):
Here are the results. They are measured in terms of MPS, that is, million samples per second -- larger number indicates higher throughput.
Categorical:
K | direct alias
--------------------------
2 | 88.4558 13.4906
4 | 16.1752 13.0807
8 | 15.3739 13.3384
16 | 14.2149 13.3299
32 | 12.3867 13.2733
64 | 10.0650 13.2782
128 | 7.3457 13.3439
256 | 4.7721 13.3459
512 | 2.8074 13.4493
1024 | 1.5340 13.4501
2048 | 0.8097 13.3803
4096 | 0.4103 13.3838
Binomial
(n, p) | rmath alias Geom BTPE Poly
----------------------------------------------------------
(2,0.3) | 11.8487 12.7639 6.4979 3.7020 6.4290
(2,0.5) | 11.9356 12.8119 4.5564 5.8752 4.5401
(2,0.9) | 12.6947 12.6949 10.7287 3.7423 10.9793
(4,0.3) | 11.4277 12.8164 4.1940 2.7137 4.1481
(4,0.5) | 11.3381 12.8352 2.8394 5.9129 2.8481
(4,0.9) | 12.4037 12.9807 8.3579 3.7877 8.3253
(8,0.3) | 10.6993 12.9467 2.5408 5.9532 2.5291
(8,0.5) | 10.3141 12.7722 1.6785 1.1590 1.6710
(8,0.9) | 11.9544 12.9081 5.6124 3.6505 5.5750
(16,0.3) | 9.7718 12.9083 1.4662 0.4141 1.4124
(16,0.5) | 9.0507 12.9732 0.9400 2.1002 0.9370
(16,0.9) | 11.2148 12.9855 3.5116 4.4885 3.3525
(32,0.3) | 8.3341 13.0976 0.8067 2.1582 0.8043
(32,0.5) | 7.0369 13.0287 0.5004 2.4892 0.4992
(32,0.9) | 10.4527 13.1709 2.0597 5.9590 2.0518
(64,0.3) | 6.3946 13.2282 0.4253 2.5309 0.4250
(64,0.5) | 7.0507 13.1909 0.2590 2.7349 2.7035
(64,0.9) | 9.3033 13.2035 1.1612 1.7338 1.1599
(128,0.3) | 6.9509 13.2026 0.2180 2.7592 2.7444
(128,0.5) | 7.3655 12.7911 0.1310 2.9881 2.9361
(128,0.9) | 7.6165 13.2572 0.6239 2.2733 0.6288
(256,0.3) | 7.2336 13.2470 0.1104 3.1682 3.1921
(256,0.5) | 7.6335 13.2319 0.0651 3.2939 3.2998
(256,0.9) | 5.5145 13.2989 0.3247 2.6314 2.6450
(512,0.3) | 7.6033 13.3139 0.0558 3.3523 3.3676
(512,0.5) | 8.1060 13.3286 0.0328 3.5462 3.5645
(512,0.9) | 6.9150 13.3382 0.1652 2.9492 2.9943
(1024,0.3) | 8.2251 13.2799 0.0281 3.6457 3.6760
(1024,0.5) | 8.5567 13.2236 0.0170 3.7714 3.7975
(1024,0.9) | 7.2475 13.3995 0.0830 3.2097 3.2585
(2048,0.3) | 8.6872 13.3920 0.0137 3.8372 3.6625
(2048,0.5) | 8.9633 13.3610 0.0084 3.9388 3.8654
(2048,0.9) | 7.9270 13.3193 0.0420 3.5115 3.4960
(4096,0.3) | 8.8841 13.3026 0.0071 4.0173 3.9788
(4096,0.5) | 9.3080 13.3894 0.0043 4.1169 4.0903
(4096,0.9) | 8.3456 13.4372 0.0214 3.7510 3.7052
Poisson
μ | rmath count AD
-----------------------------------
0.2 | 15.3078 13.3549 NaN
0.5 | 14.1964 10.4140 NaN
1.0 | 13.8194 7.8061 NaN
2.0 | 13.8571 5.2174 NaN
5.0 | 13.4399 2.7030 20.6806
10.0 | 11.9644 1.4928 20.6382
15.0 | 12.1198 1.0314 21.1652
20.0 | 12.0475 0.7907 22.0989
30.0 | 12.2999 0.5376 22.6501
Exponential
scale | base logu
--------------------------
1.0 | 17.0627 8.0938
Gamma
(α, scale) | rmath MT
-------------------------------
(0.5,1.0) | 7.8765 3.6338
(1.0,1.0) | 9.8628 5.7221
(2.0,1.0) | 10.0352 5.8673
(5.0,1.0) | 12.2446 5.8372
(20.0,1.0) | 13.0004 5.9519
We are still falling behind Rmath (over 2x slower) for Binomial and Gamma distributions.
The benchmark script is perf/samplers.jl.
That's interesting, I get different results when using the standard rand methods:
using Distributions
using RmathDist
then after warmup
julia> @time rand(Gamma(5.0,1.0),10_000_000)
elapsed time: 0.415310414 seconds (80000152 bytes allocated)
julia> @time rand(Rmath(Gamma(5.0,1.0)),10_000_000)
elapsed time: 0.4185569 seconds (80000176 bytes allocated)
The gamma distribution still uses Rmath.
I temporarily turned off the use of GammaMTSampler in recent refactoring, haven't turned it back yet.
Ah, I see. But now I get:
julia> @time rand(Distributions.GammaMTSampler(5.0,1.0),10_000_000)
elapsed time: 0.501402219 seconds (80000168 bytes allocated)
which is only 20% worse.
That's probably because the memory allocation + memory access factors in, thus the difference is not so big.
What I did in the benchmark is basically the following:
# not exactly the same code, but doing things the same way as below
function batch_sample!(sampler, n)
for i = 1:n
rand(sampler)
end
end
Purely doing sampling, and no memory-related stuff counted in the benchmark.
This one is even stranger: your tests suggest the Poisson AD sampler is twice as fast, yet:
julia> a = Distributions.PoissonADSampler(20.0)
PoissonADSampler(20.0,4.47213595499958,2400.0,18)
julia> r = Distributions.PoissonRmathSampler(20.0)
PoissonRmathSampler(20.0)
(then after warmup)
julia> @time rand(a,10_000_000);
elapsed time: 0.634922107 seconds (80000128 bytes allocated)
julia> @time rand(r,10_000_000);
elapsed time: 0.440410505 seconds (80000128 bytes allocated)
I've added some more functionality to perf/samplers.jl:
The batch method just tests the performance of the iteration (old behaviour):
$ julia perf/samplers.jl batch gamma_hi
BenchmarkTable [unit = mps]
Dist | rmath MT GD
--------------------------------------------------------------
(Gamma(shape=1.5, scale=1.0),) | 17.0605 20.6627 14.9219
(Gamma(shape=2.0, scale=1.0),) | 16.1565 20.0422 14.9253
(Gamma(shape=3.0, scale=1.0),) | 16.4016 20.1534 14.3530
(Gamma(shape=5.0, scale=1.0),) | 25.0876 21.1654 27.6697
(Gamma(shape=20.0, scale=1.0),) | 29.6898 21.9201 34.6155
The indiv tests the performance of the construction and iteration (this may be misleading for Rmath, as it uses static variables):
julia perf/samplers.jl indiv gamma_hi
BenchmarkTable [unit = mps]
Dist | rmath MT GD
--------------------------------------------------------------
(Gamma(shape=1.5, scale=1.0),) | 15.7482 19.1207 7.5798
(Gamma(shape=2.0, scale=1.0),) | 15.7333 19.5103 7.7571
(Gamma(shape=3.0, scale=1.0),) | 16.1120 19.2441 7.7500
(Gamma(shape=5.0, scale=1.0),) | 24.3544 19.8842 9.8533
(Gamma(shape=20.0, scale=1.0),) | 27.0493 20.6376 11.4383
There are obvious differences here. If the RNG is initialised, do we want both rand(d,N) and [rand(d) for i=1:N] to give identical results?
is this an issue to keep around? Likely it can't be "fixed" directly cc @simonbyrne @lindahua
Likely related to this, the perf folder is in a deprecated state, to remove or update
I'd support / help with work on this issue. I keep running into this when figuring out the best way to sample from Categorical distributions. It looks like the poly-algorithm in StatsBase.jl,
https://github.com/JuliaStats/StatsBase.jl/blob/41669cd8dfeadce35db0c4e07ac7afe5d10fb957/src/sampling.jl#L837
performs better as it has a nice rule of thumb (presumably based on the benchmarks from @lindahua at the beginning of this issue) for choosing the alias table vs direct sampling.
It would be nice to get the same performance no matter how you sample from a cagegorical 😄
This is a very old issue but I just discoverd that the cheap acceptance check in the GammaMTSampler seems to be incorrect: https://github.com/JuliaStats/Distributions.jl/pull/1617#discussion_r970098297
Fixing it improves performance quite significantly:
$ env gamma=true julia --project=. samplers.jl
[ Info: Gamma
[ Info: Low
[ Info: GammaGSSampler
[ Info: α: 0.1, result: Trial(30.570 ns)
[ Info: α: 0.5, result: Trial(45.151 ns)
[ Info: α: 0.9, result: Trial(49.529 ns)
[ Info: GammaIPSampler
[ Info: α: 0.1, result: Trial(24.658 ns)
[ Info: α: 0.5, result: Trial(25.022 ns)
[ Info: α: 0.9, result: Trial(24.698 ns)
[ Info: High
[ Info: GammaMTSampler
[ Info: α: 1.5, result: Trial(13.292 ns)
[ Info: α: 2.0, result: Trial(13.209 ns)
[ Info: α: 3.0, result: Trial(12.950 ns)
[ Info: α: 5.0, result: Trial(13.127 ns)
[ Info: α: 20.0, result: Trial(13.135 ns)
[ Info: GammaGDSampler
[ Info: α: 1.5, result: Trial(29.132 ns)
[ Info: α: 2.0, result: Trial(27.114 ns)
[ Info: α: 3.0, result: Trial(24.347 ns)
[ Info: α: 5.0, result: Trial(18.196 ns)
[ Info: α: 20.0, result: Trial(16.562 ns)
With the fix, for all tested parameter values (here and in https://github.com/JuliaStats/Distributions.jl/pull/1617) GammaMTSampler seems to be faster than GammaGDSampler.