recipes
recipes copied to clipboard
Make step_bagimpute more scalable
I'm finding step_bagimpute not particularly scalable for large datasets. This is not ideal given that bagged imputations are supposed to be the scalable alternative to imputing via KNN.
To illustrate, I'll use the credit_data dataset from modeldata. What I'm finding is that the resulting prepped recipe that uses step_bagimpute has a 38x large memory footprint than the input dataset.
I believe this is due to the rpart objects nested inside the ipred bagged model objects. They retain the call and terms object which are contributing greatly to the increased object size.
A possible solution in my opinion is to butcher the individual rpart objects inside the bagged model objects. This reduces the memory footprint of the bagged tree (and eventually the prepped recipe).
I think another solution to reduce memory footprint would be for the ability to specify that the prepped recipe stores a parsed model via tidypredict, although I haven't tried that and I guess it's possible this is a worse solution than a butchered rpart object. Additionally, tidypredict doesn't support bagged trees created via ipred. From what I understand, random forest packages cannot be used because they require missing values to be removed.
In any case, I tried the same experiment w/ credit_data but sample with replacement to get a dataset of a million rows and the memory footprint is still roughly 38x larger than the input dataset.
I'm including a reprex below.
library(tidymodels)
#> ── Attaching packages ────────────────────────────────────── tidymodels 0.1.2 ──
#> ✓ broom 0.7.2 ✓ recipes 0.1.15
#> ✓ dials 0.0.9 ✓ rsample 0.0.8
#> ✓ dplyr 1.0.2 ✓ tibble 3.0.4
#> ✓ ggplot2 3.3.2 ✓ tidyr 1.1.2
#> ✓ infer 0.5.3 ✓ tune 0.1.2
#> ✓ modeldata 0.1.0 ✓ workflows 0.2.1
#> ✓ parsnip 0.1.4 ✓ yardstick 0.0.7
#> ✓ purrr 0.3.4
#> ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter() masks stats::filter()
#> x dplyr::lag() masks stats::lag()
#> x recipes::step() masks stats::step()
library(modeldata)
library(ipred)
library(lobstr)
library(butcher)
data(credit_data)
credit_data <- as_tibble(credit_data)
map_dfc(credit_data, ~ sum(is.na(.x)))
#> # A tibble: 1 x 14
#> Status Seniority Home Time Age Marital Records Job Expenses Income
#> <int> <int> <int> <int> <int> <int> <int> <int> <int> <int>
#> 1 0 0 6 0 0 1 0 2 0 381
#> # … with 4 more variables: Assets <int>, Debt <int>, Amount <int>, Price <int>
rec_steps <- recipe(Status ~ ., data = credit_data) %>%
step_string2factor(all_nominal()) %>%
step_bagimpute(Income, seed_val = 22)
prepped <- prep(rec_steps, training = credit_data)
obj_size(credit_data)
#> 254,336 B
obj_size(prepped)
#> 9,688,976 B
bake(prepped, new_data = NULL)
#> # A tibble: 4,454 x 14
#> Seniority Home Time Age Marital Records Job Expenses Income Assets
#> <int> <fct> <int> <int> <fct> <fct> <fct> <int> <int> <int>
#> 1 9 rent 60 30 married no free… 73 129 0
#> 2 17 rent 60 58 widow no fixed 48 131 0
#> 3 10 owner 36 46 married yes free… 90 200 3000
#> 4 0 rent 60 24 single no fixed 63 182 2500
#> 5 0 rent 36 26 single no fixed 46 107 0
#> 6 1 owner 60 36 married no fixed 75 214 3500
#> 7 29 owner 60 44 married no fixed 75 125 10000
#> 8 9 pare… 12 27 single no fixed 35 80 0
#> 9 0 owner 60 32 married no free… 90 107 15000
#> 10 0 pare… 48 41 married no part… 90 80 0
#> # … with 4,444 more rows, and 4 more variables: Debt <int>, Amount <int>,
#> # Price <int>, Status <fct>
y_var <- credit_data$Income
X_frame <- credit_data %>%
select(-c(Income, Status))
bagged_2 <- ipredbagg(y_var, X = X_frame, keepX = FALSE)
obj_size(bagged_2)
#> 9,528,920 B
map(bagged_2, obj_size)
#> $y
#> 17,864 B
#>
#> $X
#> 0 B
#>
#> $mtrees
#> 18,338,712 B
#>
#> $OOB
#> 56 B
#>
#> $comb
#> 56 B
pluck(bagged_2, "mtrees") %>%
map(obj_size)
#> [[1]]
#> 9,900,408 B
#>
#> [[2]]
#> 9,901,248 B
#>
#> [[3]]
#> 9,899,928 B
#>
#> [[4]]
#> 9,901,512 B
#>
#> [[5]]
#> 9,900,024 B
#>
#> [[6]]
#> 9,900,256 B
#>
#> [[7]]
#> 9,901,488 B
#>
#> [[8]]
#> 9,901,752 B
#>
#> [[9]]
#> 9,899,960 B
#>
#> [[10]]
#> 9,900,264 B
#>
#> [[11]]
#> 9,901,360 B
#>
#> [[12]]
#> 9,900,264 B
#>
#> [[13]]
#> 9,900,344 B
#>
#> [[14]]
#> 9,901,240 B
#>
#> [[15]]
#> 9,899,360 B
#>
#> [[16]]
#> 9,900,120 B
#>
#> [[17]]
#> 9,902,624 B
#>
#> [[18]]
#> 9,901,032 B
#>
#> [[19]]
#> 9,902,072 B
#>
#> [[20]]
#> 9,900,432 B
#>
#> [[21]]
#> 9,903,784 B
#>
#> [[22]]
#> 9,902,048 B
#>
#> [[23]]
#> 9,899,784 B
#>
#> [[24]]
#> 9,901,128 B
#>
#> [[25]]
#> 9,899,904 B
pluck(bagged_2, "mtrees", 1) %>%
map(obj_size)
#> $bindx
#> 17,864 B
#>
#> $btree
#> 9,882,136 B
rpart_objs <- map_depth(bagged_2$mtrees, 1, "btree")
weigh(rpart_objs[[1]])
#> # A tibble: 30 x 2
#> object size
#> <chr> <dbl>
#> 1 call 9.83
#> 2 terms 9.56
#> 3 where 0.0185
#> 4 y 0.0185
#> 5 splits 0.00409
#> 6 functions.text 0.00213
#> 7 functions.summary 0.00202
#> 8 variable.importance 0.00175
#> 9 cptable 0.00132
#> 10 ordered 0.00107
#> # … with 20 more rows
lighter <- butcher(rpart_objs[[1]], verbose = TRUE)
#> ✓ Memory released: '9,508,312 B'
#> x Disabled: `summary()`, `printcp()`, `xpred.rpart()`
lighter_rparts <- map(rpart_objs, butcher)
bagged_3 <- bagged_2
bagged_3$mtrees <- map(
bagged_2$mtrees,
~ {
.x$btree <- butcher(.x$btree)
.x
}
)
obj_size(bagged_3)
#> 1,304,040 B
obj_size(credit_data)
#> 254,336 B
pred_3 <- predict(bagged_3, newdata = credit_data)
pred_2 <- predict(bagged_2, newdata = credit_data)
all.equal(pred_2, pred_3)
#> [1] TRUE
Created on 2021-01-25 by the reprex package (v0.3.0)
Hello @saadaslam 👋
Sorry for taking so long to get back to you. This is definitely a good idea, and would likely be a big improvement.