juliasilge.com icon indicating copy to clipboard operation
juliasilge.com copied to clipboard

Bagging with tidymodels and #TidyTuesday astronaut missions | Julia Silge

Open utterances-bot opened this issue 3 years ago • 15 comments

Bagging with tidymodels and #TidyTuesday astronaut missions | Julia Silge

Learn how to use bootstrap aggregating to predict the duration of astronaut missions.

https://juliasilge.com/blog/astronaut-missions-bagging/

utterances-bot avatar Apr 24 '21 14:04 utterances-bot

QQ Julia - do you have any resources or examples of tuning Bagged Trees with ‘fit_resamples’ ? Like how do you optimize some of the hyper parameters? THANK YOU

mwilson19 avatar Apr 24 '21 14:04 mwilson19

@mwilson19 You can tune the hyperparameters pretty much like you do any other model in tidymodels; here is a small example to get you started.

juliasilge avatar Apr 26 '21 16:04 juliasilge

Great thank you for sharing!! So does the tuning optimization still work i.e. in your small example, it's not bootstrapping the decision trees (times = 25) within each tune bootstrap, or is it? Seems like that could take a long time. In your case 5 x 25 bootstraps or 125, then multiplied by any number that would be in the parameter tuning grid?

I thought I remembered tune does use optimization for OOB in bootstraps with something, like random forest.

Thank you!

mwilson19 avatar Apr 26 '21 17:04 mwilson19

Oh, it is fitting the bagged tree to each bootstrap resample, which does take a little while! Certainly it would for a more realistic data set. Often a bagged tree can do better than, say, an xgboost or similar model even without tuning (here is an example where that happened) but if you want to tune those hyperparameters, you do need to try them out on different resamples. You can of course use the "normal" tricks to tune faster.

juliasilge avatar Apr 26 '21 20:04 juliasilge

In the code you link to (two comments up) you get "folds" from bootstrapping and in your tune_grid() you use those bootstrap folds.

why not tune_grid() using OOB? is there anything in tidymodels that will let you tune parameters using OOB? can i hack some of the caret functionality to do some OOB work?

it seems like double the (computational) work to do extra bootstrapping instead of letting the free OOB values provide model information.

thank you!!!

hardin47 avatar Nov 03 '21 22:11 hardin47

@hardin47 We don't super fluently getting those OOB samples out because we believe it is better practice to tune using a nested scheme, but if you want to see if it works out OK in your particular setting, you might want to check out this article for handling some of the objects/approaches involved.

juliasilge avatar Nov 04 '21 16:11 juliasilge

@juliasilge The nested stuff is very cool, indeed. One might even say that it has advantages over simple cross validation, too. But it is going to be hard to understand the nested mechanism without first understanding cross validation (and, dare I say, OOB errors). Not having any OOB error analysis in tidymodels will make the package less useful in the classroom, and I worry that the disconnect will have negative fallout in a variety of ways. Just my two cents... although I'm going to make a feature request as a tidymodels issue. :)

hardin47 avatar Feb 21 '22 21:02 hardin47

@hardin47 I'm so glad you posted the issue; thank you 🙌

juliasilge avatar Feb 22 '22 19:02 juliasilge

Hi Julia, Even though you have explained why don't we use step_log() for outcome variable in the video, but I still feel confuse here, does it make any different if we used step_log() rather than log() it before hand?

conlelevn avatar May 20 '22 03:05 conlelevn

@conlelevn It's not different in the sense that you are log-transforming the outcome either way. It does make a difference in that you can run into problems when predicting on new data or tuning if you preprocess the outcome using feature engineering that is suited/designed for processing predictors.

juliasilge avatar May 20 '22 04:05 juliasilge

@juliasilge juliasilge hmmm, its still sound weird to me since in Modeling GDPR violations screencast you also use step_log() the outcome variable rather than log() it before hand

conlelevn avatar May 20 '22 04:05 conlelevn

@conlelevn Yes, we have realized that it is a bad idea to recommend that folks use a recipe to process the outcome. We have this in some early blogs and teaching materials but have realized it causes problems for many users; we no longer recommend this. You can read more about this topic here.

juliasilge avatar May 22 '22 03:05 juliasilge

I'm having issues with bag_mars() and dummy variables. I was under the impression that bag_mars() requires categorical variables to be converted to dummy variables using recipe(), but when I try to run it, all models fail due to an error that note all variables are present in the supplied training set. An example with fake data:

# set up fake data frame
length <- 1000
outcome1_prob <- 0.8
weight <- c("heavy", "light")

data_outcome_1 <- tibble(
  outcome = rnorm(n = length/2, mean = 3),
  weight = sample(weight, size = length/2, replace = TRUE, prob = c(outcome1_prob, 1 - outcome1_prob)),
  length = rnorm(n = length / 2, mean = 10)
)

data_outcome_2 <- tibble(
  outcome = rnorm(n = length/2, mean = 1),
  weight = sample(weight, size = length/2, replace = TRUE, prob = c(1 - outcome1_prob, outcome1_prob)),
  length = rnorm(n = length / 2, mean = 6)
)

data <- data_outcome_1 %>% bind_rows(data_outcome_2)

# train/test split
split <- initial_split(data, prop = 0.8)

training_data <- training(split)
testing_data <- testing(split)

# recipe for data
rec <- recipe(outcome ~ ., data = training_data) %>%
  step_dummy(all_nominal())

juiced <-
  rec %>%
  prep() %>%
  juice()

folds <-
  juiced %>%
  vfold_cv(v = 10)

# model specification
mars_spec <-
  bag_mars() %>%
  set_engine("earth", times = 25) %>%
  set_mode("regression")

# workflow
tune_wf <-
  workflow() %>%
  add_recipe(rec) %>%
  add_model(mars_spec)

# fit model with cross validation
res <- tune_wf %>%
  fit_resamples(
    resamples = folds,
    control = control_resamples(save_pred = TRUE, verbose = TRUE)
  )

It seems to run fine without step_dummy() but the documentation indicates that I should still use it. Any advice? Thank you.

mrguyperson avatar Aug 24 '22 10:08 mrguyperson

@mrguyperson I think the problem is that you created folds from preprocessed data, not raw data. In tidymodels, you want to include both preprocessing (feature engineering) and model estimation together in a workflow(), and then apply it to your raw data, like folds <- vfold_cv(training_data). There is no need to use prep() because the workflow() takes care of it for you. You may want to look at:

  • https://www.tmwr.org/workflows.html
  • https://www.tmwr.org/resampling.html

juliasilge avatar Aug 25 '22 17:08 juliasilge

@juliasilge oh my gosh, that was it. Thank you so much!

mrguyperson avatar Aug 25 '22 17:08 mrguyperson