Implement weighted sampling
- ~w/o replacement is currently implemented in R~
- w/ replacement uses either probabilistic sampling or the alias method
fixes #18 fixes #45 fixes #52
- [x] (more) checks / assertions
- [x] documentation
- [x] implement w/o replacement in C++
- [x] special case n = 2 (weighted coin)
- [ ] validate the
n < 1000 * sizecut-over point between bitset and hashset
Problematic benchmark from #52 looks much better now:
library(dqrng)
m <- 1e6
n <- 1e4
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob) 22.42ms 25.5ms 38.3
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob) 7.96ms 8.78ms 114.
m <- 1e1
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob) 227µs 245µs 3976.
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob) 113µs 125µs 7508.
Created on 2023-10-07 with reprex v2.0.2
However, there is still some potential for improvement in the case of uneven weight distribution:
library(dqrng)
m <- 1e6
n <- 1e4
prob <- dqsample(m)
prob[which.max(prob)] <- m * m
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob) 18.3ms 20.5ms 47.5
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob) 21.7ms 22.5ms 43.0
m <- 1e1
prob <- dqsample(m)
prob[which.max(prob)] <- m * m
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
dqsample.int(m, n, replace = TRUE, prob = prob),
check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#> expression min median `itr/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob) 161µs 189µs 4914.
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob) 122µs 135µs 7011.
Created on 2023-10-07 with reprex v2.0.2
Interestingly the methods doing set-based rejection sampling from the last commit have better performance than the exponential rank. At least when size < n/2. Similar to the w/ replacement case it is a bit messy whether to use stochastic acceptance or the alias method.
For unweighted sampling the n < 1000 * size cut-over point between bitset and hashset is (still) valid, c.f. https://stubner.me/2023/10/algorithms-for-unweighted-sampling-without-replacement/
This is how benchmark results would change (along with a 95% confidence interval in relative change) if 43b718ddc15cf35cc21b5d7847d4a9cc869947ec is merged into main: Further explanation regarding interpretation and methodology can be found in the documentation.
This is how benchmark results would change (along with a 95% confidence interval in relative change) if 43b718ddc15cf35cc21b5d7847d4a9cc869947ec is merged into main: Further explanation regarding interpretation and methodology can be found in the documentation.
This is how benchmark results would change (along with a 95% confidence interval in relative change) if 128a3cddfb61908631968567f75acdba6525e3c1 is merged into main: Further explanation regarding interpretation and methodology can be found in the documentation.
This is how benchmark results would change (along with a 95% confidence interval in relative change) if c5c07e5fe0bed2853bfa11a0c1921200d74c8c94 is merged into main: Further explanation regarding interpretation and methodology can be found in the documentation.
Something to consider here as well: https://notstatschat.rbind.io/2024/08/26/another-way-to-not-sample-with-replacement/