juliasilge.com icon indicating copy to clipboard operation
juliasilge.com copied to clipboard

#TidyTuesday hotel bookings and recipes | Julia Silge

Open utterances-bot opened this issue 3 years ago • 20 comments

#TidyTuesday hotel bookings and recipes | Julia Silge

Last week I published my first screencast showing how to use the tidymodels framework for machine learning and modeling in R. Today, I’m using this week’s #TidyTuesday dataset on hotel bookings to show how to use one of the tidymodels packages recipes with some simple models!

https://juliasilge.com/blog/hotels-recipes/

utterances-bot avatar May 04 '21 01:05 utterances-bot

Thank you for sharing these amazing techniques! I loved the skim function in particular. I got stuck on the Ggally part though, I wasn´t able to install it by running # Github library(devtools) install_github("ggobi/ggally").

I'm new to RStudio, but I hope to learn more from your amazing videos. Cheers,

jstello avatar May 04 '21 01:05 jstello

@jstello Try installing it straight from CRAN via install.packages("GGally")

juliasilge avatar May 04 '21 01:05 juliasilge

hey julia, how do you get your code to look so neat and formatted? is there an r studio functionality that helps format your code as you type?

ntihemuka avatar May 24 '21 12:05 ntihemuka

Error: The first argument to [fit_resamples()] should be either a model or workflow.

I dont know how to shake this error? even when i copy your code exactly

ntihemuka avatar May 24 '21 14:05 ntihemuka

@ntihemuka I do make heavy use of one of the RStudio shortcuts to reindent lines, which helps with how code looks a lot. I select all (command-A on a mac) and then reindent (command-I). You can see lots of shortcuts here. The other thing I do is try to follow tidyverse style most of the time, but I'm not perfect on that.

This blog post is older and predates a change in tune where now the first argument to function like tune_grid() or fit_resamples() needs to be a model or a workflow; be sure to put that first now. If you want to see an updated version of this analysis, check out this Get Started article on tidymodels.org.

juliasilge avatar May 24 '21 15:05 juliasilge

thanks!

On Mon, May 24, 2021 at 4:07 PM Julia Silge @.***> wrote:

@ntihemuka https://github.com/ntihemuka I do make heavy use of one of the RStudio shortcuts to reindent lines, which helps with how code looks a lot. I select all (command-A on a mac) and then reindent (command-I). You can see lots of shortcuts here https://support.rstudio.com/hc/en-us/articles/200711853-Keyboard-Shortcuts. The other thing I do is try to follow tidyverse style https://style.tidyverse.org/ most of the time, but I'm not perfect on that.

This blog post is older and predates a change in tune where now the first argument to function like tune_grid() or fit_resamples() needs to be a model or a workflow; be sure to put that first now. If you want to see an updated version of this analysis, check out this Get Started article on tidymodels.org https://www.tidymodels.org/start/case-study/.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/26#issuecomment-847107993, or unsubscribe https://github.com/notifications/unsubscribe-auth/AS3DBKKHQUPTL7UUVSKREC3TPJTSZANCNFSM44BZZQDA .

ntihemuka avatar May 24 '21 15:05 ntihemuka

Hi Dr. Silge,

I tried this example from the website https://www.tidymodels.org/start/case-study/ and noticed an issue with the engine arguments. It appears you can't pass engine specific arguments like "num.threads" or "importance = impurity" with the new workflow syntax. It does work with the old set_engine syntax.

gunnergalactico avatar Aug 18 '21 00:08 gunnergalactico

hotel_stays

gunnergalactico avatar Aug 18 '21 00:08 gunnergalactico

@gunnergalactico That is correct and as expected; you can only set engine-specific arguments within set_engine().

juliasilge avatar Aug 18 '21 17:08 juliasilge

Hi, I just think that Knn is only for classification in trainining data, and It shouldn't be used to predict for a new dataset (testing data). What do you think about it? Thank you and Best regards

nguyenlovesrpy avatar Sep 07 '21 23:09 nguyenlovesrpy

@nguyenlovesrpy A nearest neighbor model can definitely be used to predict for a new dataset; check out examples here for both regression and classification.

juliasilge avatar Sep 08 '21 01:09 juliasilge

Hello. First of all thank you for all these videos, there are really helpful!

I have a question about the outcome in the confusion matrix. What are we evaluating exactly? Because when I sum the observations in the CF there are 22,900 observations, whereas the test set has 18,792 and the training set has 56,374. Why is this?

Cidree avatar Sep 18 '22 16:09 Cidree

Hello again. I think I figured it out. It is because of the Monte Carlo CV which uses in this case as validation 10% of the data 25 times, so we have 250% of observations of the training set.

Cidree avatar Sep 18 '22 17:09 Cidree

Yep, those predictions that are used in the confusion matrix are from the 25-fold resampling, where the predictions are on the held out (or "assessment") observations in each resample. You may be interested in trying out the conf_mat_resampled() function.

juliasilge avatar Sep 18 '22 19:09 juliasilge

Hi Julia, how the knn model estimate the correct k neighbors? Does model use a default value?

ghost avatar Dec 26 '22 15:12 ghost

@rcientificos You can check out details like that in the documentation for nearest_neighbor().

juliasilge avatar Dec 26 '22 16:12 juliasilge

Thank you.!. What is alternative for step_downsample in recipes? or I have to use themis package?

ghost avatar Dec 26 '22 22:12 ghost

@rcientificos Yes, that's right. The function from step_downsample() moved from recipes to themis.

juliasilge avatar Dec 27 '22 08:12 juliasilge

Hello Julia,

I noticed that you use the juiced data when you make the resamples in this vlog:

mc_cv(juice(hotel_rec), prop = 0.9, strata = children)

Am I correct that, to avoid leakage caused by step_normalize() in the recipe, it would be best to feed mc_cv() the unprocessed hotel_train data and then use the recipe when you fit the resamples?

It is a small point but I am thinking this is the modern simple example code:

# I changed juiced preped data to be the full untrained data
validation_splits <- mc_cv(hotel_train, prop = 0.9, strata = children)  

knn_spec <- nearest_neighbor() %>%
  set_engine("kknn") %>%
  set_mode("classification")
  
hotel_rec <- recipe(children ~ ., data = hotel_train) %>%
  step_downsample(children) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_zv(all_numeric()) %>%
  step_normalize(all_numeric()) 
  
# use full recipe and unprocessed resampled data  
knn_res <- fit_resamples(
  knn_spec,
  hotel_rec,  # use full recipe here vs just children ~ .,
  validation_splits,  # not pre-baked splits
  control = control_resamples(save_pred = TRUE)
)  

Do I have this right?

RaymondBalise avatar Jan 14 '24 14:01 RaymondBalise

Yes @RaymondBalise that's right. You can see that the article here using the same hotel data takes an approach more like what you describe than what I have here.

juliasilge avatar Jan 14 '24 21:01 juliasilge