Training a Linear Regression seems much lower with caret than with lm
Hey! Thanks for the great package! I am using caret to be able to use a wide range of models directly, and this is really easy, thanks to caret. However, I realized that fitting a Linear Regression using caret's train function was slower than fitting stats::lm directly (see the benchmark reported below). Is there anything I'm missing in how I am doing the training using caret? Here I don't want to tune any parameter nor perform any splitting of my data.
Thank you for your help!
Minimal, runnable code:
library(microbenchmark)
data("iris")
X <- iris[, -5]
base_lm <- function() {
stats::lm(Petal.Width ~ ., data = X)
}
caret_lm <- function() {
caret::train(Petal.Width ~ .,
data = X,
method = "lm",
trControl = caret::trainControl(method = "none")
)
}
res <- microbenchmark(NULL, base_lm(), caret_lm(), times = 50L)
print(res, unit = "ms")
#> Unit: milliseconds
#> expr min lq mean median uq
#> NULL 0.000009 0.000012 0.00003386 0.000030 0.000048
#> base_lm() 0.764566 0.854957 0.96419726 0.891547 0.945221
#> caret_lm() 154.067533 162.821262 196.87554234 164.597034 166.491618
#> max neval
#> 0.000077 50
#> 3.186349 50
#> 1766.592617 50
Session Info:
sessionInfo()
#> R version 4.1.2 (2021-11-01)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.04.3 LTS
packageVersion("caret")
#> [1] '6.0.91'
As far as I know, caret's train()-function performs by default a 5 (?) Fold Cross Validation while training the model. The lm()-function of course doesn't, that's why the latter is much faster.
It seems that caret::train function calls stats::lm only once. I wonder if this additional time is due to all the checks performed. I will try to dive deeper into this problem.
I looked into this problem and found two reasons responsible for this performance issue:
- in my case, the bottleneck of the
caret::trainfunction is the call tosystem.time. Changing this by two calls toproc.timedivides the computation time by 10. - once the first bottleneck is removed, a second one appears, which is the call to
getModelInfo. Hence, if the model is called a high number of times, this will cause some overhead. To take only account this once,getModelInfocan be called outside of the function, and themethodargument can be directly filled with the result ofgetModelInfo. Doing that leads to an extra factor of 10 in the computation time for my simple example.
I would be glad to make a PR to fix the first point.