juliasilge.com
juliasilge.com copied to clipboard
LASSO regression using tidymodels and #TidyTuesday data for The Office | Julia Silge
LASSO regression using tidymodels and #TidyTuesday data for The Office | Julia Silge
I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to tune more complex models. Today, I’m using this week’s #TidyTuesday dataset on The Office to show how to build a lasso regression model and choose regularization parameters!
Hello, thank you for this nice tutorial. It's very clear and useful.
One question, you define office_prep
but then it is not used. Where would you use a variable issued by the function prep()
?
@duccioa You typically don't need to use prep()
or bake()
if you are bundling together a model and recipe in a workflow because it takes care of it under the hood, but it is good to know how to use prep()
as an exploration/debugging tool. Data preprocessing doesn't always go the way you expect, so being able to do it yourself and see the output is helpful. You can read more about using recipes here.
@juliasilge Thank you for your answer !
Hi Julia, I'm a little bit lost about tuning the penalty
. Based on glmnet
and this github issue https://github.com/tidymodels/parsnip/issues/195, seems that the glmnet
will automatically identify the optimal lambda. So I'm wonder if I need to tune the penalty
?
You can read more about the underlying glmnet model here; notice that it does fit a whole range of regularization penalties when you fit()
to data, but you need to use something like cross-validation to pick the best value. So yes ✅ you do need to tune the regularization penalty for glmnet.
Hi Julia. Thank you for this example it is helping me a lot. I have been looking into the benefits of the relaxed lasso approach and was wondering if you think it is appropriate to maintain a tidymodels workflow, with the first lasso identifying predictors to be removed, for the second lasso's recipe? So effectively just updating the recipe for the workflow and repeating the fitting with another lasso...
@mkrasmus I have not worked through this myself, but I do think this would work. I'm more familiar with this definition of relaxed lasso, where you use lasso to do feature selection and then refit those features with no penalization (so penalty = 0
in tidymodels). I'd be super interested in seeing how you set this up!
Hi Maam,
I was trying to replicate the code,and encountered the problem in rf_res,
Error: The resamples
argument should be an 'rset' object, such as the type produced by vfold_cv()
or other 'rsample' functions.
In addition: Warning message:
The ...
are not used in this function but one or more objects were passed:
So like do i need to change the argument or something else is an issue.
@mandarpriya Hmmmm, this blog post does not create anything called rf_res
; maybe you are looking at the wrong post?
Hi, Julia. I'm in the super early stages of trying to grasp tidymodels, and this post feels like a good introduction. Thanks for all the work you do on tidymodels and on your blog.
The block of code that plots the lasso_grid metrics crashes my RStudio session reliably. I don't see anything strange about that code snippet, but I am having trouble diagnosing the problem since the session just crashes.
Do you spot anything that might have changed syntactically since you posted this?
Thanks for any help, and my sessionInfo is below ...
sessionInfo() R version 4.0.3 (2020-10-10) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] glmnet_4.1-1 Matrix_1.2-18 forcats_0.5.1 stringr_1.4.0 readr_1.4.0
[6] tidyverse_1.3.1 yardstick_0.0.8 workflowsets_0.0.2 workflows_0.2.2 tune_0.1.4
[11] tidyr_1.1.3 tibble_3.1.0 rsample_0.0.9 recipes_0.1.16 purrr_0.3.4
[16] parsnip_0.1.5 modeldata_0.1.0 infer_0.5.4 ggplot2_3.3.3 dplyr_1.0.5
[21] dials_0.0.9 scales_1.1.1 broom_0.7.6 tidymodels_0.1.3
loaded via a namespace (and not attached):
[1] fs_1.5.0 lubridate_1.7.10 doParallel_1.0.16 DiceDesign_1.9 httr_1.4.2
[6] tools_4.0.3 backports_1.2.1 utf8_1.2.1 R6_2.5.0 rpart_4.1-15
[11] DBI_1.1.1 colorspace_2.0-1 nnet_7.3-14 withr_2.4.2 tidyselect_1.1.0
[16] curl_4.3 compiler_4.0.3 cli_2.5.0 rvest_1.0.0 xml2_1.3.2
[21] labeling_0.4.2 digest_0.6.27 pkgconfig_2.0.3 parallelly_1.24.0 lhs_1.1.1
[26] dbplyr_2.1.1 rlang_0.4.10 readxl_1.3.1 rstudioapi_0.13 shape_1.4.5
[31] farver_2.1.0 generics_0.1.0 jsonlite_1.7.2 magrittr_2.0.1 Rcpp_1.0.6
[36] munsell_0.5.0 fansi_0.4.2 GPfit_1.0-8 lifecycle_1.0.0 furrr_0.2.2
[41] stringi_1.5.3 pROC_1.17.0.1 snakecase_0.11.0 MASS_7.3-53 plyr_1.8.6
[46] grid_4.0.3 parallel_4.0.3 listenv_0.8.0 crayon_1.4.1 lattice_0.20-41
[51] haven_2.4.0 splines_4.0.3 hms_1.0.0 pillar_1.6.0 codetools_0.2-16
[56] reprex_2.0.0 glue_1.4.2 modelr_0.1.8 vctrs_0.3.7 foreach_1.5.1
[61] schrute_0.2.2 cellranger_1.1.0 gtable_0.3.0 future_1.21.0 assertthat_0.2.1
[66] gower_0.2.2 janitor_2.1.0 prodlim_2019.11.13 class_7.3-17 survival_3.2-7
[71] timeDate_3043.102 iterators_1.0.13 hardhat_0.1.5 lava_1.6.9 globals_0.14.0
[76] ellipsis_0.3.1 ipred_0.9-11
@scottlyden Wow, that sounds super frustrating. 😩 I would take a look at collect_metrics(lasso_grid)
to see if anything looks strange there, and also try autoplot(lasso_grid)
to see if that crashes. It looks like your R is pretty up-to-date, so I assume your RStudio is as well? I just re-ran the code from this blog post and it did all run without crashing (although there are some differences now); I certainly would not expect anything to cause R to crash, especially when plotting. I might try running the chunk that crashes line-by-line (first line 1, then line 1 + 2, then line 1 - 3, etc) to see what exactly is causing the crash?
Hi Julia,
Thanks so much for all the tutorials/blogs/youtube videos - I think they are such a great source of information and learning. I am doing a PhD in medicine and with my dataset I am trying to predict an outcome (rare disease) using LASSO and following your code above (in addition to your blog on Himalayas and class imbalance). As I understand it feature selection is an intrinsice part of LASSO - is there any way of easily extracting what variables actually are selected? Presumably any variable that is included in your VIP graph above is selected by LASSO? It's a bit difficult to interpret though as each row of the VIP graph seems to be a a level within a variable (e.g. character Kim, Michael etc.)
Also just wondering is there any easy way to alter the metrics that are given with the "collect metrics" function, for some reason accuracy and ROC_AUC are generated when I use my data instead of the RMSE that you got above, perhaps due to fact I am using logsitic regression instead of linear?
Again thanks so much, enjoy the coffee!!!
R
@rpgately Yes, you can definitely extract the variables that the lasso algorithm selected. This chapter walks through this pretty thoroughly; look for the part where you tidy()
the lasso model output.
You can set any of a multitude of metrics that are appropriate for your model, using metric_set()
. You can read here about using metrics to judge model effectiveness and here about setting metrics during tuning or fitting.
Hi Julia,
Thank you for this detailed LASSO regression modelling. How would you recommend comparing model performance for an exercise where I have: 1) a linear (glmnet) model, 2) a LASSO linear model, 3) a random forest model and 4) a XGBoost model all predicting logRR (response ratios) from a large dataset with more than 40 variables and 10000 rows/observations. I found something here: https://www.kirenz.com/post/2021-02-17-r-classification-tidymodels/#compare-models but he does not really explain what metrics are most appropriate to use (especially when I also have "simpler" linear models), or what visualisation options are possible when using tidy.. What metrics would you use?
Any good tips and suggestions are welcomed. Thank you!
Thanks so much Julia that's a great help!
@kamaulindhardt I'm not totally sure what kind of model analysis you have here, since you say you are after a response ratio, which I don't think naturally falls out of a tree-based model (I think you would need a partial dependence plot or similar?). However, all that aside, the thing to do is to choose metrics appropriate to your problem, either for regression or classification. It doesn't matter that some of your models are linear and some are more complex; you can compare them using the same metrics.
Hi again Julia,
Is there a way to plot the LASSO predictors vs. the actual outcome values? Like we have for glmnet linear models?
Cheers,
@kamaulindhardt If there are particular plots from glmnet you want to create, you can use pull_workflow_fit()
to get the parsnip fit, then access the $fit
object and call those glmnet functions on it.
@kamaulindhardt I'm not totally sure what kind of model analysis you have here, since you say you are after a response ratio, which I don't think naturally falls out of a tree-based model (I think you would need a partial dependence plot or similar?). However, all that aside, the thing to do is to choose metrics appropriate to your problem, either for regression or classification. It doesn't matter that some of your models are linear and some are more complex; you can compare them using the same metrics.
Thanks for this reply (see above). Just to clarify; I am simply using one column “logRR” in the dataset as my outcome/dependent variable while all other variables are predictors in my regression. So it’s not that I expect RR or logRR to come out of my tree-based model 😉 A partial dependence plot could be a good option maybe for each model.. but not to compare models I guess?
RE: @mandapriya, I also get an error when creating rf_res
.
rf_res <- fit_resamples(
weekly_attendance ~ .,
rf_spec,
nfl_folds,
control = control_resamples(save_pred = TRUE)
)
Error: The first argument to [fit_resamples()] should be either a model or workflow.
This blog post is not about random forest or NFL data, but the one that is is fairly old, and there was a change to tune a while back so that you should now put either a workflow or a model first. Hence the error message:
The first argument to [fit_resamples()] should be either a model or workflow.
The fix is to put your model or workflow as the first argument, like this:
rf_res <- fit_resamples(
rf_spec,
weekly_attendance ~ .,
nfl_folds,
control = control_resamples(save_pred = TRUE)
)
Hi, Julia, thanks for your tutorial.
After get the rf_res
by fit_resamples
,how to get the results on nfl_test
?
@ayue2019 Are you maybe thinking of a different blog post? That one does show how to predict on the test.
I figure I messed up and not you :-)Thanks
Hi, Julia, your link is still this post... #TidyTuesday and tidymodels (https://juliasilge.com/blog/intro-tidymodels/)
Oh, weird! I was looking at the comments on the GitHub backend and they are labeled wrong somehow. My apologies!
If you have a fitted model called rf_fit
, you can get predictions on the test set via predict(rf_fit, new_data = nfl_test)
. A resampled object like rf_res
does not have a fitted model to use for prediction. You might want to look into using last_fit()
and collect_metrics()
as shown in some other posts, like this one.
Yeah! Thank you very much.
Hi Julia, Thanks for the tutorial. I am trying to replicate your example on my own data and am getting an error.
Here is the code snippet where I am facing an issue (train_tbl is the training data ; perc is the dependent variable with values between 0 and 1):
results_train <- lm_fit %>% predict(new_data = train_tbl) %>% mutate( truth =ts( train_tbl$perc), model = "lm" ) %>% bind_rows(rf_fit %>% predict(new_data = train_tbl) %>% mutate( truth = ts( train_tbl$perc), model = "rf" ))
I am getting the following error and was not able to identify the solution based on the stackoverflow search:
Error in UseMethod("mutate") : no applicable method for 'mutate' applied to an object of class "c('double', 'numeric')"
Any chance you can help with this?
@lilugarold It's hard to say without your data but I am guessing it might be related to using the ts()
function there. Can you create a reprex demonstrating your problem and post this on RStudio Community? That's a good way to share a coding problem like this and get help.
Hi Julia, Thanks for the tutorial. I am trying to replicate your example on my own data and am getting an error.
Here is the code snippet where I am facing an issue (train_tbl is the training data ; perc is the dependent variable with values between 0 and 1):
results_train <- lm_fit %>% predict(new_data = train_tbl) %>% mutate( truth =ts( train_tbl$perc), model = "lm" ) %>% bind_rows(rf_fit %>% predict(new_data = train_tbl) %>% mutate( truth = ts( train_tbl$perc), model = "rf" ))
I am getting the following error and was not able to identify the solution based on the stackoverflow search:
Error in UseMethod("mutate") : no applicable method for 'mutate' applied to an object of class "c('double', 'numeric')"
Any chance you can help with this?
@lilugarold
I think you might have a problem in predict(new_data = train_tbl) %>% mutate(...
The usual output of predict
is a numeric vector, but mutate
is a function to create a new column in a data.frame so it takes a data.frame and not a numeric. Here you are telling mutate to create new columns 'truth' and 'model' on a nuneric vector.