juliasilge.com
juliasilge.com copied to clipboard
Preprocessing and resampling using #TidyTuesday college data | Julia Silge
Preprocessing and resampling using #TidyTuesday college data | Julia Silge
I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first getting started to how to tune machine learning models. Today, I’m using this week’s #TidyTuesday dataset on college tuition and diversity at US colleges to show some data preprocessing steps and how to use resampling!
Hello Julia, I notice for that Chapter 2 on Stack Overflow Survey, you use step_downsample while for Chapter 3 on Get Out the Vote, you use step_upsample. Would you help me to understand when (or based on what criteria) would we use which one?
For Overflow, it is 1 Not remote 6273 2 Remote 718
For Vote, it is 1 Did not vote 264 2 Voted 6428
@tanthiamhuat There isn't an easy answer to this but I'll point you to some resources:
- https://www.tmwr.org/recipes.html#row-sampling-steps
- https://www.tidymodels.org/learn/models/sub-sampling/
- https://themis.tidymodels.org/reference/index.html
I think the real answer is that you have try both and see what works for your data. A really rough guideline might be that if you have "a lot" of data, downsampling can be a good thing to try. The upsampling approaches can be more likely to lead to memorizing the minority class and it can sometimes be tough to overcome that with the "fancy" approaches like SMOTE or ROSE.
Hi Silge,
Regarding to estimate model metrics, the argument for fit_resamples() might has been updated since it did not allow me to call out mix of metrics at same time. Could you please tell me how to fix it?
Here is the error msgs:
tree_rs <- tree_spec %>%
- fit_resamples(
-
uni_rec,
-
folds,
-
metrics = metric_set(roc_auc,sens, spec),
-
control = control_resamples(save_pred = TRUE))
Error in validate_function_class()
:
!
The combination of metric functions must be:
- only numeric metrics
- a mix of class metrics and class probability metrics
The following metric function types are being mixed:
- prob (roc_auc)
- class (sens)
- other (spec namespace:readr)
Run
rlang::last_error()
to see where the error occurred.
Second question: in ML, regarding to linear model, do we usually care about the violation of it assumptions (linear relationship, autocorrelation...)?
Thanks
@conlelevn
- I think you have some old versions of packages. I recommend updating to the latest CRAN versions of the tidymodels packages. At the worst, you could try
metric_set(roc_auc, sens, yardstick::spec)
. - It is definitely important to consider how the model you use can be impacted by characteristics of your data! Violations of the assumptions for a linear model will impact uncertainty and goodness of fit measures more than predictions, usually.
@juliasilge Thanks Julia, I have fixed it by updating newer version of R and Tidymodels package. BTW, pls consider to make some screencasting regarding to neural network next time :) i really keen on learning the process of making it in Tidymodels. Thanks and rg