sl3
sl3 copied to clipboard
Reduce model storage size
A lot of models store copies of the data and other large objects. See discussion here: http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r/ related to glm
. Our model fit objects should store only what's necessary for prediction. We should build a toolset similar to what's discussed in that blogpost, and use it to validate that all learners are storing the minimal amount necessary
Are there any updates or additional guidance on reducing model storage size while retaining the ability to predict and/or refit the algorithm?
I'm running a fairly large Lrnr_sl()
and would like to get cross-validated risks for the ensemble super learner using CV_lrnr_sl()
. As far as I can tell, I need the sl fit object to do so, but it's about 15gb. With various analyses/subanalyses, I have multiple such fit objects I need to save.
Unfortunately, we haven't prioritized this task, although I agree we should follow up on it.
If you're not already doing so, setting keepExtra=FALSE
for Lrnr_sl
will save a lot of memory.
For the particular case of trying to get cross-validated risks for the ensemble, it seems to me that CV_lrnr_sl as currently implemented is particularly memory-inefficient, because it will (temporarily) store separate SL fits for each external CV fold, so if your underlying sl fit is 15gb, and you're doing 10-fold external cross-validation, you'd use 165gb. You could instead use origami
to return a SE (or other risk metric) for each fold without storing the corresponding fit, which would then only use marginally more than 15gb. This would work similarly to the vignette here: https://tlverse.org/origami/articles/generalizedCV.html .
Hope this helps, and sorry we haven't made much progress on the memory management stuff.
Thank you for the tips. I'll take a look at the generalizedCV example. Pretty sure I have keepExtra=FALSE
, but I will double-check.
You've probably already considered this, or my reasoning may not make sense in the context of sl3
, but I thought I would at least add the idea to this thread for reference in case it's helpful.
Recently, I ran into some processing time issues running a large glmnet
job. Using a sparse matrix (as created by the makeX
) function vastly reduced both the size of the input data and the run time—what I suspect in the latter case is a byproduct of having to copy less data to each core during parallel processing.
I'm not sure which of the other learners support sparse matrix input or if it would be practical to generalize storage of predictor data in this format throughout sl3
, but perhaps an avenue to explore, with data.table
the output format for data sets returned directly to the user (as implemented currently).
@jrgant , finally getting to this. If you want to help, you can inspect some model fits for me:
- Install this branch:
devtools::install_github("tlverse/sl3@downsize-fit-objects")
- For individual learner fits for learners you use (i.e. not on a full
Lrnr_sl
) , doprint(sl3:::check_fit_sizes(fit))
to see which components of a fit_object are the worst offenders. Let me know what you're seeing for learners of interest and I'll try to implement fit reduction for them.
Thanks, I will try to run some tests during the next week or two and post the results here.
Sorry it took me so long to get to this.
I went ahead and made two files:
-
learner_sizes.csv - A
data.table
storing the sizes of all component learners, as output byserialize()
. My understanding is that these values are reported in bytes. -
learner_element_props.rds - A named list containing the output of
sl3:::check_fit_sizes()
from the branch you provided.
The full Lrnr_sl
object that contains these component fits is 19GB, in case that helps makes sense of these numbers.
Let me know if there's other info I can provide.