Optimal set sizes for ResamplingHoldout

Open tdhock opened this issue 5 years ago • 1 comments

Hi, I noticed that ResamplingHoldout can sometimes return train/test set sizes that are sub-optimal, in the sense of the binomial likelihood. For example for a set size of N=2 (trivial I know) and a ratio of 0.7 train data, the most likely train set size is 2 (with 0 test data), which has probability 0.49:

> dbinom(0:2, 2, 0.7)
[1] 0.09 0.42 0.49

However current mlr3 master gives me 1 train and 1 test sample, which has probability 0.42 (that is sub-optimal). I added this simple example as a test case.

So I know this is not an issue for big/real data, but it is an easy fix to get code that is optimal. You just have to use the formula for the mode of the binomial distribution, https://en.wikipedia.org/wiki/Binomial_distribution#Mode

Apr 24 '20 21:04 tdhock

Codecov Report

Merging #493 into master will increase coverage by 0.00%. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #493   +/-   ##
=======================================
  Coverage   92.65%   92.66%           
=======================================
  Files          75       75           
  Lines        1934     1936    +2     
=======================================
+ Hits         1792     1794    +2     
  Misses        142      142

Impacted Files	Coverage Δ
R/ResamplingHoldout.R	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6b86b8e...f838376. Read the comment docs.

Oct 31 '20 02:10 codecov-io