juliasilge.com
juliasilge.com copied to clipboard
Tuning random forest hyperparameters with #TidyTuesday trees data | Julia Silge
Tuning random forest hyperparameters with #TidyTuesday trees data | Julia Silge
I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using a #TidyTuesday dataset from earlier this year on trees around San Francisco to show how to tune the hyperparameters of a random forest model and then use the final best model.
Hi, is there a way to tune a model using also the testing data. I mean, we should train the model with the training data and picking the best using the test, is that correct?
You can read more about spending your data budget in this chapter. The purpose of the testing data is to estimate performance on new data. To tune a model or pick the best model, you can use resampled data or a validation set, which we like to think of as a single resample.
Hi Julia, from the last step, how can I get the confusion matrix? I can't figure it out!
@Chihui8199 You should be able to do this to create a confusion matrix for the test set:
final_res %>%
collect_predictions() %>%
conf_mat(legal_status, .pred_class)
Thank you so much for these great blog articles which have really helped me working with tidymodels! One question, once I've done my 'last_fit', how best to save the model and use it at a later date for predictions on new data? I can't seem to find any good resources on deploying fitted models. Thanks!
Hey Julia! I never expected that you will respond!!! That was immensely helpful! Enjoyed the guide a lot :)
@michael-hainke The output of last_fit()
contains a workflow that you can use for prediction on new data. I show how to do that in this post and this post.
@juliasilge Thanks for the quick reply, this is great!
@Juliasilge what does the grid = 20 under tune_grid exactly mean. After reading the documentation I still don't quite understand. Thank you in advance :)
@Chihui8199 Setting grid = 20
says to choose 20 parameter sets automatically for the random forest model, based on what we know about random forest models and such. If you want to dig deeper into what's going on, I recommend this chapter of TMwR.
Hi Julia, I love your content, it is very helpful.
I am running this code but with my data. I have followed this tutorial but Iwhen i run this line
set.seed(345)
tune_res <- tune_grid(
tune_wf,
resamples = trees_folds,
grid = 20
)
I get this error message Error: To tune a model spec, you must preprocess with a formula or recipe
.
I tried to apply the prep()
function but it doesn't work. Could you help me with this please.
@yerazo3599 Take a look at your tune_wf
object; it sounds like it does not have a formula or recipe added.
To get some more detailed help, I recommend laying out a reprex and posting on RStudio Community. Good luck! 🙌
Hey Julia, loved this video very much! just started learning R, I am doing a master's in data science. I am also considering getting a PhD when I finished. My career aspirations are not to teach but to just be good at machine learning and data science. As someone who has been there, would you recommend getting a PHD or just self-learn with books and resources written by the experts? Also getting a job that pays bank in tech wouldn't be awesome :)
Generally, if your goal is to work in data science as a practitioner, I don't think a PhD is the way to go. You might check out https://shouldigetaphd.com/ for some more perspective on this!
thanks julia
On Thu, May 13, 2021 at 9:49 PM Julia Silge @.***> wrote:
Generally, if your goal is to work in data science as a practitioner, I don't think a PhD is the way to go. You might check out https://shouldigetaphd.com/ for some more perspective on this!
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/6#issuecomment-840825185, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS3DBKIU5TRO3MZAEFAPD33TNQ3LZANCNFSM4Y2HWGFA .
Hello Julia, thank a lot for the post and the video. Do you recommend to add importance variable for training at the initial tune_spec or update the workflow at the final_wf?
@guberney I believe (you can check this for yourself with your data to see if it makes a difference) that training can be slower when importance is being computed, so you may not want to include it for all tuning iterations.
Hello Julia , again thank you for your amazing works. How I can use weights in this model ?
@canlikala tidymodels doesn't currently support case weights, but we are tracking interest in this issue and ideas on implementation here.
Hi Julia and everyone,
Hope you had a good weekend. I am working with a large geo-referenced point data with biophysical variables (e.g. soil pH, precipitation etc) and would like to test for spatial autocorrelation in my data. Any tidy-friendly way to test for this particular autocorrelation? As it is an element important for my further Machine Leaning regression analysis I would need to take it into account to not produce algorithms/models that a erroneous in predicting the outcome variable.
Thank you
@kamaulindhardt A good place to ask a question like this is on RStudio Community. Be sure to create a reprex to show folks what kind of data you are dealing with.
Is there a specific reason you used roc_auc as a metric for tuning and not accuracy?
@nvelden I think it's generally more rare for overall accuracy to be the most useful/appropriate metric for real-world classification problems. Making a metric choice is super connected to your specific problem in its real context. You can check out metric options in tidymodels here.
Hi Julia, your posts are helping a lot!!! I would like to know if I have to sample a big dataset, to a get a represntative sample, if there is available any option with tidymodels? I have thought that rsample package would be a choice, but I do not know about it. Thanks!
@data-datum If you want to subsample your data as part of feature engineering to balance classes, take a look at themis. If you just want to sample down overall, I'd probably use slice_sample()
from dplyr.
I asked the same question at https://juliasilge.com/blog/astronaut-missions-bagging/ (so apologies for bugging you twice), but it feels so much more relevant to this blog.
why tune mtry
using cross validation instead of out of bag information? seems like oob tuning could be very useful and helpful!
thank you for all that you do. your screencasts are amazing!
@hardin47 I think the main reason is that the performance estimates you get if tuning on OOB samples don't always turn out well, and maybe even mtry
doesn't get chosen well.
Dear Julia, thank you very much for these screencasts and useful information. I am currently working on a 2-class classification problem with a high number of predictors (~300) and limited number of samples (150 class 1/300 class 2). In line with @hardin47's question, I was considering of optimising tuning parameters using the OOB-errors instead of CV-errors. The paper you refer to seems to support this to a certain degree, especially when sample size is not extremely small and when using stratified subsampling (to avoid severe class imbalances in in-bag/out-of-bag samples). Of course tuning parameters using the OOB-errors would be beneficial, as I can use more data to build the model. Also in literature, this seems like a quite well supported approach, mostly noting that OOB may be overly pessimistic. I know on the other hand {tidymodels} focusses on 'empirical validation' (=CV). Do you have any additional thoughts on this? Would you consider tuning based on OOB-errors (is that even possible in {tidy models} when the number of samples is limited?
@wsteenhu We don't super fluently support getting those OOB estimates out because we believe it is better practice to tune using a nested scheme as a matter of general pratice, but if you want to see if it works out OK in your particular setting, you might want to check out this article for how you might manually handle some of the objects/approaches involved. This article might also help you extract the bits you want to manually get at.
Hi Julia,
I am working on a multi-class classification problem. In the variable importance step, I would like to plot variable importance for each class to find out whether a variable is more important for discriminating one class from another. I used local.importance = TRUE, but it didn't work.
Thank you