dqrng icon indicating copy to clipboard operation
dqrng copied to clipboard

Implement weighted sampling

Open rstub opened this issue 2 years ago • 8 comments

  • ~w/o replacement is currently implemented in R~
  • w/ replacement uses either probabilistic sampling or the alias method

fixes #18 fixes #45 fixes #52

  • [x] (more) checks / assertions
  • [x] documentation
  • [x] implement w/o replacement in C++
  • [x] special case n = 2 (weighted coin)
  • [ ] validate the n < 1000 * size cut-over point between bitset and hashset

rstub avatar Oct 07 '23 20:10 rstub

Problematic benchmark from #52 looks much better now:

library(dqrng)

m <- 1e6
n <- 1e4
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
                  dqsample.int(m, n, replace = TRUE, prob = prob),
                  check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#>   expression                                           min   median `itr/sec`
#>   <bch:expr>                                      <bch:tm> <bch:tm>     <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob)    22.42ms   25.5ms      38.3
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob)   7.96ms   8.78ms     114.


m <- 1e1
prob <- dqrunif(m)
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
                  dqsample.int(m, n, replace = TRUE, prob = prob),
                  check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#>   expression                                           min   median `itr/sec`
#>   <bch:expr>                                      <bch:tm> <bch:tm>     <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob)      227µs    245µs     3976.
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob)    113µs    125µs     7508.

Created on 2023-10-07 with reprex v2.0.2

However, there is still some potential for improvement in the case of uneven weight distribution:

library(dqrng)

m <- 1e6
n <- 1e4
prob <- dqsample(m)
prob[which.max(prob)] <- m * m
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
                  dqsample.int(m, n, replace = TRUE, prob = prob),
                  check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#>   expression                                           min   median `itr/sec`
#>   <bch:expr>                                      <bch:tm> <bch:tm>     <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob)     18.3ms   20.5ms      47.5
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob)   21.7ms   22.5ms      43.0


m <- 1e1
prob <- dqsample(m)
prob[which.max(prob)] <- m * m
bm <- bench::mark(sample.int(m, n, replace = TRUE, prob = prob),
                  dqsample.int(m, n, replace = TRUE, prob = prob),
                  check = FALSE)
bm[, 1:4]
#> # A tibble: 2 × 4
#>   expression                                           min   median `itr/sec`
#>   <bch:expr>                                      <bch:tm> <bch:tm>     <dbl>
#> 1 sample.int(m, n, replace = TRUE, prob = prob)      161µs    189µs     4914.
#> 2 dqsample.int(m, n, replace = TRUE, prob = prob)    122µs    135µs     7011.

Created on 2023-10-07 with reprex v2.0.2

rstub avatar Oct 07 '23 20:10 rstub

Interestingly the methods doing set-based rejection sampling from the last commit have better performance than the exponential rank. At least when size < n/2. Similar to the w/ replacement case it is a bit messy whether to use stochastic acceptance or the alias method.

rstub avatar Oct 12 '23 14:10 rstub

For unweighted sampling the n < 1000 * size cut-over point between bitset and hashset is (still) valid, c.f. https://stubner.me/2023/10/algorithms-for-unweighted-sampling-without-replacement/

rstub avatar Oct 24 '23 12:10 rstub

This is how benchmark results would change (along with a 95% confidence interval in relative change) if 43b718ddc15cf35cc21b5d7847d4a9cc869947ec is merged into main: Further explanation regarding interpretation and methodology can be found in the documentation.

github-actions[bot] avatar Apr 22 '24 20:04 github-actions[bot]

This is how benchmark results would change (along with a 95% confidence interval in relative change) if 43b718ddc15cf35cc21b5d7847d4a9cc869947ec is merged into main: Further explanation regarding interpretation and methodology can be found in the documentation.

github-actions[bot] avatar Apr 23 '24 17:04 github-actions[bot]

This is how benchmark results would change (along with a 95% confidence interval in relative change) if 128a3cddfb61908631968567f75acdba6525e3c1 is merged into main: Further explanation regarding interpretation and methodology can be found in the documentation.

github-actions[bot] avatar Apr 23 '24 17:04 github-actions[bot]

This is how benchmark results would change (along with a 95% confidence interval in relative change) if c5c07e5fe0bed2853bfa11a0c1921200d74c8c94 is merged into main: Further explanation regarding interpretation and methodology can be found in the documentation.

github-actions[bot] avatar Apr 27 '24 19:04 github-actions[bot]

Something to consider here as well: https://notstatschat.rbind.io/2024/08/26/another-way-to-not-sample-with-replacement/

rstub avatar Aug 26 '24 05:08 rstub