caretEnsemble icon indicating copy to clipboard operation
caretEnsemble copied to clipboard

Large datasets

Open jknowles opened this issue 11 years ago • 4 comments

I'm not sure how common folks are using train on large datasets, but my use case for caretEnsemble involves models with a minimum of 40k training observations. In this environment, we might need some additional checks built into the methods for caretEnsemble to enable them to make sense / perform well with these large models.

Primarily this is an issue currently for autoplot.caretEnsemble which makes a great looking residual plot, but plotting 40k residuals takes far too long and is not useful. Needs a statement like

if(nobs > 1000){
     plotdf <- sample(plotdf, ... 1000)
}

I'll build this in and submit a PR.

jknowles avatar Nov 17 '14 17:11 jknowles

This makes a lot of sense. I also usually use this with 40k+ observations. I think sampling for large datasets makes sense for plotting.

zachmayer avatar Nov 17 '14 17:11 zachmayer

Great. I'll drop this into a PR as well.

We can revisit later whether we should do this for the summary methods as well in some cases. I will explore the performance hit and look for speed improvements while we are building out the model.trim functions.

jknowles avatar Nov 17 '14 22:11 jknowles

Sounds good to me!

zachmayer avatar Nov 17 '14 22:11 zachmayer

#81 might help with this too. I think do.call ends up serializing and copying a lot of data. I'll investigate other methods of capturing and modifying arguments headed to train (namely, trControl).

zachmayer avatar Dec 12 '14 15:12 zachmayer