juliasilge.com icon indicating copy to clipboard operation
juliasilge.com copied to clipboard

Preprocessing and resampling using #TidyTuesday college data | Julia Silge

Open utterances-bot opened this issue 3 years ago • 5 comments

Preprocessing and resampling using #TidyTuesday college data | Julia Silge

I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first getting started to how to tune machine learning models. Today, I’m using this week’s #TidyTuesday dataset on college tuition and diversity at US colleges to show some data preprocessing steps and how to use resampling!

https://juliasilge.com/blog/tuition-resampling/

utterances-bot avatar Sep 13 '21 01:09 utterances-bot

Hello Julia, I notice for that Chapter 2 on Stack Overflow Survey, you use step_downsample while for Chapter 3 on Get Out the Vote, you use step_upsample. Would you help me to understand when (or based on what criteria) would we use which one?

For Overflow, it is 1 Not remote 6273 2 Remote 718

For Vote, it is 1 Did not vote 264 2 Voted 6428

tanthiamhuat avatar Sep 13 '21 01:09 tanthiamhuat

@tanthiamhuat There isn't an easy answer to this but I'll point you to some resources:

  • https://www.tmwr.org/recipes.html#row-sampling-steps
  • https://www.tidymodels.org/learn/models/sub-sampling/
  • https://themis.tidymodels.org/reference/index.html

I think the real answer is that you have try both and see what works for your data. A really rough guideline might be that if you have "a lot" of data, downsampling can be a good thing to try. The upsampling approaches can be more likely to lead to memorizing the minority class and it can sometimes be tough to overcome that with the "fancy" approaches like SMOTE or ROSE.

juliasilge avatar Sep 13 '21 17:09 juliasilge

Hi Silge,

Regarding to estimate model metrics, the argument for fit_resamples() might has been updated since it did not allow me to call out mix of metrics at same time. Could you please tell me how to fix it?

Here is the error msgs:

tree_rs <- tree_spec %>%

  • fit_resamples(
  • uni_rec,
    
  • folds,
    
  • metrics = metric_set(roc_auc,sens, spec),
    
  • control = control_resamples(save_pred = TRUE))
    

Error in validate_function_class(): ! The combination of metric functions must be:

  • only numeric metrics
  • a mix of class metrics and class probability metrics

The following metric function types are being mixed:

  • prob (roc_auc)
  • class (sens)
  • other (spec namespace:readr) Run rlang::last_error() to see where the error occurred.

Second question: in ML, regarding to linear model, do we usually care about the violation of it assumptions (linear relationship, autocorrelation...)?

Thanks

conlelevn avatar Apr 28 '22 08:04 conlelevn

@conlelevn

  • I think you have some old versions of packages. I recommend updating to the latest CRAN versions of the tidymodels packages. At the worst, you could try metric_set(roc_auc, sens, yardstick::spec).
  • It is definitely important to consider how the model you use can be impacted by characteristics of your data! Violations of the assumptions for a linear model will impact uncertainty and goodness of fit measures more than predictions, usually.

juliasilge avatar Apr 28 '22 13:04 juliasilge

@juliasilge Thanks Julia, I have fixed it by updating newer version of R and Tidymodels package. BTW, pls consider to make some screencasting regarding to neural network next time :) i really keen on learning the process of making it in Tidymodels. Thanks and rg

conlelevn avatar Apr 29 '22 03:04 conlelevn