sl3 icon indicating copy to clipboard operation
sl3 copied to clipboard

Reduce model storage size

Open jeremyrcoyle opened this issue 7 years ago • 7 comments

A lot of models store copies of the data and other large objects. See discussion here: http://www.win-vector.com/blog/2014/05/trimming-the-fat-from-glm-models-in-r/ related to glm. Our model fit objects should store only what's necessary for prediction. We should build a toolset similar to what's discussed in that blogpost, and use it to validate that all learners are storing the minimal amount necessary

jeremyrcoyle avatar Aug 09 '17 01:08 jeremyrcoyle

Are there any updates or additional guidance on reducing model storage size while retaining the ability to predict and/or refit the algorithm?

I'm running a fairly large Lrnr_sl() and would like to get cross-validated risks for the ensemble super learner using CV_lrnr_sl(). As far as I can tell, I need the sl fit object to do so, but it's about 15gb. With various analyses/subanalyses, I have multiple such fit objects I need to save.

jrgant avatar Apr 07 '21 14:04 jrgant

Unfortunately, we haven't prioritized this task, although I agree we should follow up on it.

If you're not already doing so, setting keepExtra=FALSE for Lrnr_sl will save a lot of memory.

For the particular case of trying to get cross-validated risks for the ensemble, it seems to me that CV_lrnr_sl as currently implemented is particularly memory-inefficient, because it will (temporarily) store separate SL fits for each external CV fold, so if your underlying sl fit is 15gb, and you're doing 10-fold external cross-validation, you'd use 165gb. You could instead use origami to return a SE (or other risk metric) for each fold without storing the corresponding fit, which would then only use marginally more than 15gb. This would work similarly to the vignette here: https://tlverse.org/origami/articles/generalizedCV.html .

Hope this helps, and sorry we haven't made much progress on the memory management stuff.

jeremyrcoyle avatar Apr 07 '21 14:04 jeremyrcoyle

Thank you for the tips. I'll take a look at the generalizedCV example. Pretty sure I have keepExtra=FALSE, but I will double-check.

jrgant avatar Apr 20 '21 15:04 jrgant

You've probably already considered this, or my reasoning may not make sense in the context of sl3, but I thought I would at least add the idea to this thread for reference in case it's helpful.

Recently, I ran into some processing time issues running a large glmnet job. Using a sparse matrix (as created by the makeX) function vastly reduced both the size of the input data and the run time—what I suspect in the latter case is a byproduct of having to copy less data to each core during parallel processing.

I'm not sure which of the other learners support sparse matrix input or if it would be practical to generalize storage of predictor data in this format throughout sl3, but perhaps an avenue to explore, with data.table the output format for data sets returned directly to the user (as implemented currently).

jrgant avatar Sep 03 '21 13:09 jrgant

@jrgant , finally getting to this. If you want to help, you can inspect some model fits for me:

  1. Install this branch: devtools::install_github("tlverse/sl3@downsize-fit-objects")
  2. For individual learner fits for learners you use (i.e. not on a full Lrnr_sl) , do print(sl3:::check_fit_sizes(fit)) to see which components of a fit_object are the worst offenders. Let me know what you're seeing for learners of interest and I'll try to implement fit reduction for them.

jeremyrcoyle avatar Oct 01 '21 18:10 jeremyrcoyle

Thanks, I will try to run some tests during the next week or two and post the results here.

jrgant avatar Oct 05 '21 21:10 jrgant

Sorry it took me so long to get to this.

I went ahead and made two files:

  1. learner_sizes.csv - A data.table storing the sizes of all component learners, as output by serialize(). My understanding is that these values are reported in bytes.
  2. learner_element_props.rds - A named list containing the output of sl3:::check_fit_sizes() from the branch you provided.

The full Lrnr_sl object that contains these component fits is 19GB, in case that helps makes sense of these numbers.

Let me know if there's other info I can provide.

jrgant avatar Feb 09 '22 22:02 jrgant