recipes icon indicating copy to clipboard operation
recipes copied to clipboard

Make step_bagimpute more scalable

Open saadaslam opened this issue 4 years ago • 1 comments
trafficstars

I'm finding step_bagimpute not particularly scalable for large datasets. This is not ideal given that bagged imputations are supposed to be the scalable alternative to imputing via KNN.

To illustrate, I'll use the credit_data dataset from modeldata. What I'm finding is that the resulting prepped recipe that uses step_bagimpute has a 38x large memory footprint than the input dataset.

I believe this is due to the rpart objects nested inside the ipred bagged model objects. They retain the call and terms object which are contributing greatly to the increased object size.

A possible solution in my opinion is to butcher the individual rpart objects inside the bagged model objects. This reduces the memory footprint of the bagged tree (and eventually the prepped recipe).

I think another solution to reduce memory footprint would be for the ability to specify that the prepped recipe stores a parsed model via tidypredict, although I haven't tried that and I guess it's possible this is a worse solution than a butchered rpart object. Additionally, tidypredict doesn't support bagged trees created via ipred. From what I understand, random forest packages cannot be used because they require missing values to be removed.

In any case, I tried the same experiment w/ credit_data but sample with replacement to get a dataset of a million rows and the memory footprint is still roughly 38x larger than the input dataset.

I'm including a reprex below.

library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 0.1.2 ──
#> ✓ broom     0.7.2      ✓ recipes   0.1.15
#> ✓ dials     0.0.9      ✓ rsample   0.0.8 
#> ✓ dplyr     1.0.2      ✓ tibble    3.0.4 
#> ✓ ggplot2   3.3.2      ✓ tidyr     1.1.2 
#> ✓ infer     0.5.3      ✓ tune      0.1.2 
#> ✓ modeldata 0.1.0      ✓ workflows 0.2.1 
#> ✓ parsnip   0.1.4      ✓ yardstick 0.0.7 
#> ✓ purrr     0.3.4
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()
library(modeldata)
library(ipred)
library(lobstr)
library(butcher)

data(credit_data)
credit_data <- as_tibble(credit_data)

map_dfc(credit_data, ~ sum(is.na(.x)))
#> # A tibble: 1 x 14
#>   Status Seniority  Home  Time   Age Marital Records   Job Expenses Income
#>    <int>     <int> <int> <int> <int>   <int>   <int> <int>    <int>  <int>
#> 1      0         0     6     0     0       1       0     2        0    381
#> # … with 4 more variables: Assets <int>, Debt <int>, Amount <int>, Price <int>

rec_steps <- recipe(Status ~ ., data = credit_data) %>% 
  step_string2factor(all_nominal()) %>% 
  step_bagimpute(Income, seed_val = 22) 

prepped <- prep(rec_steps, training = credit_data)

obj_size(credit_data)
#> 254,336 B
obj_size(prepped)
#> 9,688,976 B

bake(prepped, new_data = NULL)
#> # A tibble: 4,454 x 14
#>    Seniority Home   Time   Age Marital Records Job   Expenses Income Assets
#>        <int> <fct> <int> <int> <fct>   <fct>   <fct>    <int>  <int>  <int>
#>  1         9 rent     60    30 married no      free…       73    129      0
#>  2        17 rent     60    58 widow   no      fixed       48    131      0
#>  3        10 owner    36    46 married yes     free…       90    200   3000
#>  4         0 rent     60    24 single  no      fixed       63    182   2500
#>  5         0 rent     36    26 single  no      fixed       46    107      0
#>  6         1 owner    60    36 married no      fixed       75    214   3500
#>  7        29 owner    60    44 married no      fixed       75    125  10000
#>  8         9 pare…    12    27 single  no      fixed       35     80      0
#>  9         0 owner    60    32 married no      free…       90    107  15000
#> 10         0 pare…    48    41 married no      part…       90     80      0
#> # … with 4,444 more rows, and 4 more variables: Debt <int>, Amount <int>,
#> #   Price <int>, Status <fct>

y_var <- credit_data$Income
X_frame <- credit_data %>% 
  select(-c(Income, Status))

bagged_2 <- ipredbagg(y_var, X = X_frame, keepX = FALSE)

obj_size(bagged_2)
#> 9,528,920 B

map(bagged_2, obj_size)
#> $y
#> 17,864 B
#> 
#> $X
#> 0 B
#> 
#> $mtrees
#> 18,338,712 B
#> 
#> $OOB
#> 56 B
#> 
#> $comb
#> 56 B

pluck(bagged_2, "mtrees") %>% 
  map(obj_size)
#> [[1]]
#> 9,900,408 B
#> 
#> [[2]]
#> 9,901,248 B
#> 
#> [[3]]
#> 9,899,928 B
#> 
#> [[4]]
#> 9,901,512 B
#> 
#> [[5]]
#> 9,900,024 B
#> 
#> [[6]]
#> 9,900,256 B
#> 
#> [[7]]
#> 9,901,488 B
#> 
#> [[8]]
#> 9,901,752 B
#> 
#> [[9]]
#> 9,899,960 B
#> 
#> [[10]]
#> 9,900,264 B
#> 
#> [[11]]
#> 9,901,360 B
#> 
#> [[12]]
#> 9,900,264 B
#> 
#> [[13]]
#> 9,900,344 B
#> 
#> [[14]]
#> 9,901,240 B
#> 
#> [[15]]
#> 9,899,360 B
#> 
#> [[16]]
#> 9,900,120 B
#> 
#> [[17]]
#> 9,902,624 B
#> 
#> [[18]]
#> 9,901,032 B
#> 
#> [[19]]
#> 9,902,072 B
#> 
#> [[20]]
#> 9,900,432 B
#> 
#> [[21]]
#> 9,903,784 B
#> 
#> [[22]]
#> 9,902,048 B
#> 
#> [[23]]
#> 9,899,784 B
#> 
#> [[24]]
#> 9,901,128 B
#> 
#> [[25]]
#> 9,899,904 B
pluck(bagged_2, "mtrees", 1) %>% 
  map(obj_size)
#> $bindx
#> 17,864 B
#> 
#> $btree
#> 9,882,136 B

rpart_objs <- map_depth(bagged_2$mtrees, 1, "btree")

weigh(rpart_objs[[1]])
#> # A tibble: 30 x 2
#>    object                 size
#>    <chr>                 <dbl>
#>  1 call                9.83   
#>  2 terms               9.56   
#>  3 where               0.0185 
#>  4 y                   0.0185 
#>  5 splits              0.00409
#>  6 functions.text      0.00213
#>  7 functions.summary   0.00202
#>  8 variable.importance 0.00175
#>  9 cptable             0.00132
#> 10 ordered             0.00107
#> # … with 20 more rows

lighter <- butcher(rpart_objs[[1]], verbose = TRUE)
#> ✓ Memory released: '9,508,312 B'
#> x Disabled: `summary()`, `printcp()`, `xpred.rpart()`

lighter_rparts <- map(rpart_objs, butcher)

bagged_3 <- bagged_2

bagged_3$mtrees <- map(
  bagged_2$mtrees,
  ~ {
    .x$btree <- butcher(.x$btree)
    .x
  }
)

obj_size(bagged_3)  
#> 1,304,040 B
obj_size(credit_data)  
#> 254,336 B

pred_3 <- predict(bagged_3, newdata = credit_data)
pred_2 <- predict(bagged_2, newdata = credit_data)

all.equal(pred_2, pred_3)
#> [1] TRUE

Created on 2021-01-25 by the reprex package (v0.3.0)

saadaslam avatar Jan 25 '21 23:01 saadaslam

Hello @saadaslam 👋

Sorry for taking so long to get back to you. This is definitely a good idea, and would likely be a big improvement.

EmilHvitfeldt avatar Mar 30 '23 21:03 EmilHvitfeldt