Tuning random forest hyperparameters with #TidyTuesday trees data | Julia Silge

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using a #TidyTuesday dataset from earlier this year on trees around San Francisco to show how to tune the hyperparameters of a random forest model and then use the final best model.

https://juliasilge.com/blog/sf-trees-random-tuning/

Mar 08 '21 22:03 utterances-bot

Hi, is there a way to tune a model using also the testing data. I mean, we should train the model with the training data and picking the best using the test, is that correct?

Mar 08 '21 22:03 cvaldezerea

You can read more about spending your data budget in this chapter. The purpose of the testing data is to estimate performance on new data. To tune a model or pick the best model, you can use resampled data or a validation set, which we like to think of as a single resample.

Mar 09 '21 18:03 juliasilge

Hi Julia, from the last step, how can I get the confusion matrix? I can't figure it out!

Mar 26 '21 07:03 Chihui8199

@Chihui8199 You should be able to do this to create a confusion matrix for the test set:

final_res %>%
    collect_predictions() %>%
    conf_mat(legal_status, .pred_class)

Mar 26 '21 16:03 juliasilge

Thank you so much for these great blog articles which have really helped me working with tidymodels! One question, once I've done my 'last_fit', how best to save the model and use it at a later date for predictions on new data? I can't seem to find any good resources on deploying fitted models. Thanks!

Mar 26 '21 20:03 michael-hainke

Hey Julia! I never expected that you will respond!!! That was immensely helpful! Enjoyed the guide a lot :)

Mar 27 '21 11:03 Chihui8199

@michael-hainke The output of last_fit() contains a workflow that you can use for prediction on new data. I show how to do that in this post and this post.

Mar 27 '21 20:03 juliasilge

@juliasilge Thanks for the quick reply, this is great!

Mar 29 '21 03:03 michael-hainke

@Juliasilge what does the grid = 20 under tune_grid exactly mean. After reading the documentation I still don't quite understand. Thank you in advance :)

Mar 30 '21 05:03 Chihui8199

@Chihui8199 Setting grid = 20 says to choose 20 parameter sets automatically for the random forest model, based on what we know about random forest models and such. If you want to dig deeper into what's going on, I recommend this chapter of TMwR.

Mar 30 '21 16:03 juliasilge

Hi Julia, I love your content, it is very helpful.

I am running this code but with my data. I have followed this tutorial but Iwhen i run this line

set.seed(345)
tune_res <- tune_grid(
  tune_wf,
  resamples = trees_folds,
  grid = 20
)

I get this error message Error: To tune a model spec, you must preprocess with a formula or recipe . I tried to apply the prep() function but it doesn't work. Could you help me with this please.

Apr 14 '21 05:04 yerazo3599

@yerazo3599 Take a look at your tune_wf object; it sounds like it does not have a formula or recipe added.

To get some more detailed help, I recommend laying out a reprex and posting on RStudio Community. Good luck! 🙌

Apr 14 '21 15:04 juliasilge

Hey Julia, loved this video very much! just started learning R, I am doing a master's in data science. I am also considering getting a PhD when I finished. My career aspirations are not to teach but to just be good at machine learning and data science. As someone who has been there, would you recommend getting a PHD or just self-learn with books and resources written by the experts? Also getting a job that pays bank in tech wouldn't be awesome :)

May 13 '21 13:05 ntihemuka

Generally, if your goal is to work in data science as a practitioner, I don't think a PhD is the way to go. You might check out https://shouldigetaphd.com/ for some more perspective on this!

May 13 '21 20:05 juliasilge

thanks julia

On Thu, May 13, 2021 at 9:49 PM Julia Silge @.***> wrote:

Generally, if your goal is to work in data science as a practitioner, I don't think a PhD is the way to go. You might check out https://shouldigetaphd.com/ for some more perspective on this!

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/6#issuecomment-840825185, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS3DBKIU5TRO3MZAEFAPD33TNQ3LZANCNFSM4Y2HWGFA .

May 13 '21 21:05 ntihemuka

Hello Julia, thank a lot for the post and the video. Do you recommend to add importance variable for training at the initial tune_spec or update the workflow at the final_wf?

May 21 '21 20:05 guberney

@guberney I believe (you can check this for yourself with your data to see if it makes a difference) that training can be slower when importance is being computed, so you may not want to include it for all tuning iterations.

May 21 '21 21:05 juliasilge

Hello Julia , again thank you for your amazing works. How I can use weights in this model ?

May 26 '21 23:05 canlikala

@canlikala tidymodels doesn't currently support case weights, but we are tracking interest in this issue and ideas on implementation here.

May 27 '21 01:05 juliasilge

Hi Julia and everyone,

Hope you had a good weekend. I am working with a large geo-referenced point data with biophysical variables (e.g. soil pH, precipitation etc) and would like to test for spatial autocorrelation in my data. Any tidy-friendly way to test for this particular autocorrelation? As it is an element important for my further Machine Leaning regression analysis I would need to take it into account to not produce algorithms/models that a erroneous in predicting the outcome variable.

Thank you

Jun 27 '21 21:06 kamaulindhardt

@kamaulindhardt A good place to ask a question like this is on RStudio Community. Be sure to create a reprex to show folks what kind of data you are dealing with.

Jun 27 '21 22:06 juliasilge

Is there a specific reason you used roc_auc as a metric for tuning and not accuracy?

Sep 03 '21 10:09 nvelden

@nvelden I think it's generally more rare for overall accuracy to be the most useful/appropriate metric for real-world classification problems. Making a metric choice is super connected to your specific problem in its real context. You can check out metric options in tidymodels here.

Sep 03 '21 15:09 juliasilge

Hi Julia, your posts are helping a lot!!! I would like to know if I have to sample a big dataset, to a get a represntative sample, if there is available any option with tidymodels? I have thought that rsample package would be a choice, but I do not know about it. Thanks!

Oct 12 '21 19:10 data-datum

@data-datum If you want to subsample your data as part of feature engineering to balance classes, take a look at themis. If you just want to sample down overall, I'd probably use slice_sample() from dplyr.

Oct 12 '21 20:10 juliasilge

I asked the same question at https://juliasilge.com/blog/astronaut-missions-bagging/ (so apologies for bugging you twice), but it feels so much more relevant to this blog.

why tune mtry using cross validation instead of out of bag information? seems like oob tuning could be very useful and helpful!

thank you for all that you do. your screencasts are amazing!

Nov 03 '21 22:11 hardin47

@hardin47 I think the main reason is that the performance estimates you get if tuning on OOB samples don't always turn out well, and maybe even mtry doesn't get chosen well.

Nov 04 '21 16:11 juliasilge

Dear Julia, thank you very much for these screencasts and useful information. I am currently working on a 2-class classification problem with a high number of predictors (~300) and limited number of samples (150 class 1/300 class 2). In line with @hardin47's question, I was considering of optimising tuning parameters using the OOB-errors instead of CV-errors. The paper you refer to seems to support this to a certain degree, especially when sample size is not extremely small and when using stratified subsampling (to avoid severe class imbalances in in-bag/out-of-bag samples). Of course tuning parameters using the OOB-errors would be beneficial, as I can use more data to build the model. Also in literature, this seems like a quite well supported approach, mostly noting that OOB may be overly pessimistic. I know on the other hand {tidymodels} focusses on 'empirical validation' (=CV). Do you have any additional thoughts on this? Would you consider tuning based on OOB-errors (is that even possible in {tidy models} when the number of samples is limited?

Nov 09 '21 12:11 wsteenhu

@wsteenhu We don't super fluently support getting those OOB estimates out because we believe it is better practice to tune using a nested scheme as a matter of general pratice, but if you want to see if it works out OK in your particular setting, you might want to check out this article for how you might manually handle some of the objects/approaches involved. This article might also help you extract the bits you want to manually get at.

Nov 09 '21 15:11 juliasilge

Hi Julia,
I am working on a multi-class classification problem. In the variable importance step, I would like to plot variable importance for each class to find out whether a variable is more important for discriminating one class from another. I used local.importance = TRUE, but it didn't work.

Thank you

Dec 24 '21 06:12 ArianaSam

juliasilge.com juliasilge.com copied to clipboard

Tuning random forest hyperparameters with #TidyTuesday trees data | Julia Silge

Tuning random forest hyperparameters with #TidyTuesday trees data | Julia Silge

juliasilge.com
juliasilge.com copied to clipboard