juliasilge.com
juliasilge.com copied to clipboard
Bagging with tidymodels and #TidyTuesday astronaut missions | Julia Silge
Bagging with tidymodels and #TidyTuesday astronaut missions | Julia Silge
Learn how to use bootstrap aggregating to predict the duration of astronaut missions.
QQ Julia - do you have any resources or examples of tuning Bagged Trees with ‘fit_resamples’ ? Like how do you optimize some of the hyper parameters? THANK YOU
@mwilson19 You can tune the hyperparameters pretty much like you do any other model in tidymodels; here is a small example to get you started.
Great thank you for sharing!! So does the tuning optimization still work i.e. in your small example, it's not bootstrapping the decision trees (times = 25) within each tune bootstrap, or is it? Seems like that could take a long time. In your case 5 x 25 bootstraps or 125, then multiplied by any number that would be in the parameter tuning grid?
I thought I remembered tune does use optimization for OOB in bootstraps with something, like random forest.
Thank you!
Oh, it is fitting the bagged tree to each bootstrap resample, which does take a little while! Certainly it would for a more realistic data set. Often a bagged tree can do better than, say, an xgboost or similar model even without tuning (here is an example where that happened) but if you want to tune those hyperparameters, you do need to try them out on different resamples. You can of course use the "normal" tricks to tune faster.
In the code you link to (two comments up) you get "folds" from bootstrapping and in your tune_grid()
you use those bootstrap folds.
why not tune_grid()
using OOB? is there anything in tidymodels that will let you tune parameters using OOB? can i hack some of the caret functionality to do some OOB work?
it seems like double the (computational) work to do extra bootstrapping instead of letting the free OOB values provide model information.
thank you!!!
@hardin47 We don't super fluently getting those OOB samples out because we believe it is better practice to tune using a nested scheme, but if you want to see if it works out OK in your particular setting, you might want to check out this article for handling some of the objects/approaches involved.
@juliasilge The nested stuff is very cool, indeed. One might even say that it has advantages over simple cross validation, too. But it is going to be hard to understand the nested mechanism without first understanding cross validation (and, dare I say, OOB errors). Not having any OOB error analysis in tidymodels will make the package less useful in the classroom, and I worry that the disconnect will have negative fallout in a variety of ways. Just my two cents... although I'm going to make a feature request as a tidymodels issue. :)
@hardin47 I'm so glad you posted the issue; thank you 🙌
Hi Julia, Even though you have explained why don't we use step_log() for outcome variable in the video, but I still feel confuse here, does it make any different if we used step_log() rather than log() it before hand?
@conlelevn It's not different in the sense that you are log-transforming the outcome either way. It does make a difference in that you can run into problems when predicting on new data or tuning if you preprocess the outcome using feature engineering that is suited/designed for processing predictors.
@juliasilge juliasilge hmmm, its still sound weird to me since in Modeling GDPR violations screencast you also use step_log() the outcome variable rather than log() it before hand
@conlelevn Yes, we have realized that it is a bad idea to recommend that folks use a recipe to process the outcome. We have this in some early blogs and teaching materials but have realized it causes problems for many users; we no longer recommend this. You can read more about this topic here.
I'm having issues with bag_mars() and dummy variables. I was under the impression that bag_mars() requires categorical variables to be converted to dummy variables using recipe(), but when I try to run it, all models fail due to an error that note all variables are present in the supplied training set. An example with fake data:
# set up fake data frame
length <- 1000
outcome1_prob <- 0.8
weight <- c("heavy", "light")
data_outcome_1 <- tibble(
outcome = rnorm(n = length/2, mean = 3),
weight = sample(weight, size = length/2, replace = TRUE, prob = c(outcome1_prob, 1 - outcome1_prob)),
length = rnorm(n = length / 2, mean = 10)
)
data_outcome_2 <- tibble(
outcome = rnorm(n = length/2, mean = 1),
weight = sample(weight, size = length/2, replace = TRUE, prob = c(1 - outcome1_prob, outcome1_prob)),
length = rnorm(n = length / 2, mean = 6)
)
data <- data_outcome_1 %>% bind_rows(data_outcome_2)
# train/test split
split <- initial_split(data, prop = 0.8)
training_data <- training(split)
testing_data <- testing(split)
# recipe for data
rec <- recipe(outcome ~ ., data = training_data) %>%
step_dummy(all_nominal())
juiced <-
rec %>%
prep() %>%
juice()
folds <-
juiced %>%
vfold_cv(v = 10)
# model specification
mars_spec <-
bag_mars() %>%
set_engine("earth", times = 25) %>%
set_mode("regression")
# workflow
tune_wf <-
workflow() %>%
add_recipe(rec) %>%
add_model(mars_spec)
# fit model with cross validation
res <- tune_wf %>%
fit_resamples(
resamples = folds,
control = control_resamples(save_pred = TRUE, verbose = TRUE)
)
It seems to run fine without step_dummy() but the documentation indicates that I should still use it. Any advice? Thank you.
@mrguyperson I think the problem is that you created folds from preprocessed data, not raw data. In tidymodels, you want to include both preprocessing (feature engineering) and model estimation together in a workflow()
, and then apply it to your raw data, like folds <- vfold_cv(training_data)
. There is no need to use prep()
because the workflow()
takes care of it for you. You may want to look at:
- https://www.tmwr.org/workflows.html
- https://www.tmwr.org/resampling.html
@juliasilge oh my gosh, that was it. Thank you so much!