Optimal set sizes for ResamplingHoldout
Hi, I noticed that ResamplingHoldout can sometimes return train/test set sizes that are sub-optimal, in the sense of the binomial likelihood. For example for a set size of N=2 (trivial I know) and a ratio of 0.7 train data, the most likely train set size is 2 (with 0 test data), which has probability 0.49:
> dbinom(0:2, 2, 0.7)
[1] 0.09 0.42 0.49
However current mlr3 master gives me 1 train and 1 test sample, which has probability 0.42 (that is sub-optimal). I added this simple example as a test case.
So I know this is not an issue for big/real data, but it is an easy fix to get code that is optimal. You just have to use the formula for the mode of the binomial distribution, https://en.wikipedia.org/wiki/Binomial_distribution#Mode
Codecov Report
Merging #493 into master will increase coverage by
0.00%. The diff coverage is100.00%.
@@ Coverage Diff @@
## master #493 +/- ##
=======================================
Coverage 92.65% 92.66%
=======================================
Files 75 75
Lines 1934 1936 +2
=======================================
+ Hits 1792 1794 +2
Misses 142 142
| Impacted Files | Coverage Δ | |
|---|---|---|
| R/ResamplingHoldout.R | 100.00% <100.00%> (ø) |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact),ø = not affected,? = missing dataPowered by Codecov. Last update 6b86b8e...f838376. Read the comment docs.