model-implementation-principles
model-implementation-principles copied to clipboard
Check for integer values for arguments
I don't know if this fits within this repo so please bear with me.
I have been thinking about what happens when we check the type of the arguments we pass into our models, and for the most part that task is fairly easy. But I feel we have overlooked the case when we are concerned with argument such as times , epochs and mtry that require integers. Often the approach is to round down the double silently which I think is a little dangerous.
Examples
library(ranger)
ranger(
mpg ~ .,
data = mtcars,
mtry = 3.8,
num.trees = 20.2,
importance = 'impurity'
)
#> Ranger result
#>
#> Call:
#> ranger(mpg ~ ., data = mtcars, mtry = 3.8, num.trees = 20.2, importance = "impurity")
#>
#> Type: Regression
#> Number of trees: 20
#> Sample size: 32
#> Number of independent variables: 10
#> Mtry: 3
#> Target node size: 5
#> Variable importance mode: impurity
#> Splitrule: variance
#> OOB prediction error (MSE): 5.142068
#> R squared (OOB): 0.8584392
library(parsnip)
rand_forest(mode = "regression", mtry = 2.3) %>%
fit(mpg ~ .,
data = mtcars,
engine = "ranger")
#> parsnip model object
#>
#> Ranger result
#>
#> Call:
#> ranger::ranger(formula = formula, data = data, mtry = 2.3, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
#>
#> Type: Regression
#> Number of trees: 500
#> Sample size: 32
#> Number of independent variables: 10
#> Mtry: 2
#> Target node size: 5
#> Variable importance mode: none
#> Splitrule: variance
#> OOB prediction error (MSE): 6.186759
#> R squared (OOB): 0.829679
I feel like this problem is already being looked at in the vctrs package with vec_cast but might be something you should consider when implementing models in R.
I like this as a general principle for writing functions, and vctrs's lossy cast warning functions are pretty nice for this. What's nice about it is that integerish things like the numeric 2 don't trigger the warning but 2.2 would. I personally don't think this is specific enough to be a model principle, but could see it living as a general principle in the functions chapter of the next version of advanced R (but I have no control over that). Maybe Max has a different opinion though because it is a useful point overall.
Certainly I think we should have a section about input validation, potentially listing some easy tests to do to avoid mistakes like this.