Large datasets
I'm not sure how common folks are using train on large datasets, but my use case for caretEnsemble involves models with a minimum of 40k training observations. In this environment, we might need some additional checks built into the methods for caretEnsemble to enable them to make sense / perform well with these large models.
Primarily this is an issue currently for autoplot.caretEnsemble which makes a great looking residual plot, but plotting 40k residuals takes far too long and is not useful. Needs a statement like
if(nobs > 1000){
plotdf <- sample(plotdf, ... 1000)
}
I'll build this in and submit a PR.
This makes a lot of sense. I also usually use this with 40k+ observations. I think sampling for large datasets makes sense for plotting.
Great. I'll drop this into a PR as well.
We can revisit later whether we should do this for the summary methods as well in some cases. I will explore the performance hit and look for speed improvements while we are building out the model.trim functions.
Sounds good to me!
#81 might help with this too. I think do.call ends up serializing and copying a lot of data. I'll investigate other methods of capturing and modifying arguments headed to train (namely, trControl).