juliasilge.com
juliasilge.com copied to clipboard
#TidyTuesday hotel bookings and recipes | Julia Silge
#TidyTuesday hotel bookings and recipes | Julia Silge
Last week I published my first screencast showing how to use the tidymodels framework for machine learning and modeling in R. Today, I’m using this week’s #TidyTuesday dataset on hotel bookings to show how to use one of the tidymodels packages recipes with some simple models!
Thank you for sharing these amazing techniques! I loved the skim function in particular. I got stuck on the Ggally part though, I wasn´t able to install it by running # Github library(devtools) install_github("ggobi/ggally").
I'm new to RStudio, but I hope to learn more from your amazing videos. Cheers,
@jstello Try installing it straight from CRAN via install.packages("GGally")
hey julia, how do you get your code to look so neat and formatted? is there an r studio functionality that helps format your code as you type?
Error: The first argument to [fit_resamples()] should be either a model or workflow.
I dont know how to shake this error? even when i copy your code exactly
@ntihemuka I do make heavy use of one of the RStudio shortcuts to reindent lines, which helps with how code looks a lot. I select all (command-A on a mac) and then reindent (command-I). You can see lots of shortcuts here. The other thing I do is try to follow tidyverse style most of the time, but I'm not perfect on that.
This blog post is older and predates a change in tune where now the first argument to function like tune_grid()
or fit_resamples()
needs to be a model or a workflow; be sure to put that first now. If you want to see an updated version of this analysis, check out this Get Started article on tidymodels.org.
thanks!
On Mon, May 24, 2021 at 4:07 PM Julia Silge @.***> wrote:
@ntihemuka https://github.com/ntihemuka I do make heavy use of one of the RStudio shortcuts to reindent lines, which helps with how code looks a lot. I select all (command-A on a mac) and then reindent (command-I). You can see lots of shortcuts here https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts. The other thing I do is try to follow tidyverse style https://style.tidyverse.org/ most of the time, but I'm not perfect on that.
This blog post is older and predates a change in tune where now the first argument to function like tune_grid() or fit_resamples() needs to be a model or a workflow; be sure to put that first now. If you want to see an updated version of this analysis, check out this Get Started article on tidymodels.org https://www.tidymodels.org/start/case-study/.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/26#issuecomment-847107993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS3DBKKHQUPTL7UUVSKREC3TPJTSZANCNFSM44BZZQDA .
Hi Dr. Silge,
I tried this example from the website https://www.tidymodels.org/start/case-study/ and noticed an issue with the engine arguments. It appears you can't pass engine specific arguments like "num.threads" or "importance = impurity" with the new workflow syntax. It does work with the old set_engine syntax.
@gunnergalactico That is correct and as expected; you can only set engine-specific arguments within set_engine()
.
Hi, I just think that Knn is only for classification in trainining data, and It shouldn't be used to predict for a new dataset (testing data). What do you think about it? Thank you and Best regards
@nguyenlovesrpy A nearest neighbor model can definitely be used to predict for a new dataset; check out examples here for both regression and classification.
Hello. First of all thank you for all these videos, there are really helpful!
I have a question about the outcome in the confusion matrix. What are we evaluating exactly? Because when I sum the observations in the CF there are 22,900 observations, whereas the test set has 18,792 and the training set has 56,374. Why is this?
Hello again. I think I figured it out. It is because of the Monte Carlo CV which uses in this case as validation 10% of the data 25 times, so we have 250% of observations of the training set.
Yep, those predictions that are used in the confusion matrix are from the 25-fold resampling, where the predictions are on the held out (or "assessment") observations in each resample. You may be interested in trying out the conf_mat_resampled()
function.
Hi Julia, how the knn model estimate the correct k neighbors? Does model use a default value?
@rcientificos You can check out details like that in the documentation for nearest_neighbor()
.
Thank you.!. What is alternative for step_downsample in recipes? or I have to use themis package?
@rcientificos Yes, that's right. The function from step_downsample()
moved from recipes to themis.
Hello Julia,
I noticed that you use the juiced data when you make the resamples in this vlog:
mc_cv(juice(hotel_rec), prop = 0.9, strata = children)
Am I correct that, to avoid leakage caused by step_normalize()
in the recipe, it would be best to feed mc_cv()
the unprocessed hotel_train
data and then use the recipe when you fit the resamples?
It is a small point but I am thinking this is the modern simple example code:
# I changed juiced preped data to be the full untrained data
validation_splits <- mc_cv(hotel_train, prop = 0.9, strata = children)
knn_spec <- nearest_neighbor() %>%
set_engine("kknn") %>%
set_mode("classification")
hotel_rec <- recipe(children ~ ., data = hotel_train) %>%
step_downsample(children) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_numeric()) %>%
step_normalize(all_numeric())
# use full recipe and unprocessed resampled data
knn_res <- fit_resamples(
knn_spec,
hotel_rec, # use full recipe here vs just children ~ .,
validation_splits, # not pre-baked splits
control = control_resamples(save_pred = TRUE)
)
Do I have this right?
Yes @RaymondBalise that's right. You can see that the article here using the same hotel data takes an approach more like what you describe than what I have here.