Handle class imbalance in #TidyTuesday climbing expedition data with tidymodels | Julia Silge

A data science blog

https://juliasilge.com/blog/himalayan-climbing/

May 26 '22 03:05 utterances-bot

Hi Julia,

I'm little bit confuse in understand this result:

glm_rs %>% conf_mat_resampled()

# A tibble: 4 x 3

Prediction Truth Freq

1 died died 55.5

2 died survived 2157.

3 survived died 26.5

4 survived survived 3499.

Does the Freq column represent the relative or absolute value? how can we interpret this table?

May 26 '22 03:05 conlelevn

You can read more about a confusion matrix to learn about this; Freq is a count of observations. You have non-integer values because this is a resampled confusion matrix.

May 26 '22 15:05 juliasilge

Using recipe >> themis::step_smote() produces this, while cross validation...

x Fold6: preprocessor 1/1: Error in smote_impl(): ! Not enough observations of '4' to perform SMOTE. How can I avoid it? The data is already stratified with...

theSplit <- df %>% initial_split(prop=0.8,strata = value) and... myFolds <- vfold_cv(df_train,strata=value)

Aug 04 '22 07:08 Steviey

@Steviey Do you get this error with the example data here from the blog post? Or your own data? It sounds like the dataset is too small for SMOTE perhaps?

Aug 04 '22 16:08 juliasilge

Hi Julia, nice to hear from you. I'm new to tidy-multiclass predictions I get it with my own data. I thought the whole point of SMOTE is to handle underrepresented minority classes. If I filter out the underrepresented minority class (4) SMOTE works. Is it a good practice to do so?

    balanceInfo     <- df_train %>% count(value)
    print('balanceInfo:')
    print(balanceInfo)
    stop()

unfiltered, not working SMOTE: value n 1 1 318 2 2 113 3 3 45 4 4 4

filtered, working SMOTE: value n 1 1 323 2 2 111 3 3 38

Aug 05 '22 20:08 Steviey

@Steviey It looks like there are only 4 examples of that class? Trying to use SMOTE with that kind of data doesn't sound like a good idea to me; that's probably why there are protections against it. You will need to think through how realistic it is to build a multiclass model where one class has only 4 observations.

Aug 08 '22 02:08 juliasilge

Ah, OK thank you. That was my idea too. I'm restructuring the data, to get a slight imbalance with enough observations in each class.

Aug 08 '22 13:08 Steviey

I' m always confused, how to predict in the future (in production). Should we use the test-set as 'leave one out', or should we better produce a future frame- like with timetik::tk_make_future_timeseries()?

Aug 18 '22 01:08 Steviey

@Steviey Most predictive models are not time series, even though you take the trained model and then predict in the future after the model was fitted. You might find the 3rd paragraph there especially useful for understanding.

If you are not dealing with a forecasting model, then you will use predict() for production with your trained (non-time-series) model. The features might not involve any time components at all.

Aug 18 '22 14:08 juliasilge

Thank you Julia, as I understand for non-time-series, in production I fit a model with all data I have and then predict() with the fitted model. But what should we do with param: newdata or new_data of the predict()-function? Would this mean, I need at least one dataset out of sample to make a production prediction in classifications and regressions? And is the outcome of that prediction then a projection in the future or in the present?

Aug 19 '22 17:08 Steviey

You generally fit a model to use in a predictive way when you will get new data in the future; you use the existing pool of historical data for building and evaluating your model (training and testing) and then you predict after you are done with new examples/observations. You might find these resources helpful:

https://vas3k.com/blog/machine_learning/
https://en.wikipedia.org/wiki/Predictive_modelling

Aug 19 '22 17:08 juliasilge

So if partitions: training + testing = all data I have (historical + present), what would I feed in the param newdata of function stats::predict() when doing classifications or regressions?

Aug 19 '22 19:08 Steviey

You would use the new data you get moving forward; people typically categorize these kinds of predictions as batch or real time/online. For discussion like this on ML in general, you may have a better experience posting on RStudio Community, which is a great forum for getting perspective on these kinds of modeling questions.

Aug 19 '22 22:08 juliasilge

I do my best... https://community.rstudio.com/t/stats-predict/145815

Aug 30 '22 17:08 Steviey

Hi Julia, Thanks for the post. This is really helpful. I am also a little confused interpreting confusion matrix. Shouldn't below confusion matrix be balanced since you put upsampling procedure in your recipe? I was expecting sum of row 1 and row 3 frequency should be similar to the sum of row 2 and row 4 frequency. glm_rs %>% conf_mat_resampled()

# A tibble: 4 x 3

Prediction Truth Freq

1 died died 55.5

2 died survived 2157.

3 survived died 26.5

4 survived survived 3499.

Thank you!

Aug 19 '23 03:08 DanielYooCDC

@DanielYooCDC When you subsample (upsample or downsample) it's very important to only do that for the training data, not the testing data. This also applies within resampling, like for what we call analysis and assessment sets -- only subsample the analysis set. The tidymodels functions take care of this for you and you can more about this here and here.

Aug 19 '23 18:08 juliasilge

Hi Julia, Thank you for the response! That makes a whole lot of sense. I have additional question that has been hovering around my head. I do understand step_smote only takes numerical variables and so you did convert categorical variables to dummy variables. After upsampling, you'd get some value between 0 and 1 for dummy variables (for example, upsampled observation of season_autumn variable would be 0.67). In reality, dummy variable should be either 0 or 1. How should we justify the model that has been trained on upsampled training data but the value is far from reality? I noticed there are other upsampling methods like step_smotenc, which takes both categorical and continuous as input. When I tested out step_smotenc without creating dummy variable and create dummy variables before running step_smote , the results were comparable. Thank you so much for your time!

Aug 20 '23 01:08 DanielYooCDC

@DanielYooCDC The new synthetic observations being created via the SMOTE algorithm aren't real anyway, so it's not a problem that they can end up with a value that is not 0 or 1. I would point out that 0.67 is not "from from reality" at all, but nicely between 0 and 1. I would expect (or at least hope!) that the various implementations of upsampling with SMOTE give you about the same results.

Aug 20 '23 22:08 juliasilge

Hi Julia, thanks again for this helpful material. I understand that subsampling should not be applied to the testing set, but I am confused how we can use the same workflow that that we applied to our training set (members_wf) in combination with last_fit() without applying step_smote() contained within the workflow. Is there something inherent within last_fit() that prevents this from happening?

Jan 10 '24 20:01 bnagelson

@bnagelson Yep, you can more about this here, but what controls that behavior is the skip argument of each recipe step.

Jan 10 '24 21:01 juliasilge

Excellent, thank you very much!

On Wed, Jan 10, 2024 at 1:11 PM Julia Silge @.***> wrote:

@bnagelson https://github.com/bnagelson Yep, you can more about this here https://www.tmwr.org/recipes#skip-equals-true, but what controls that behavior is the skip argument of each recipe step.

— Reply to this email directly, view it on GitHub https://github.com/juliasilge/juliasilge.com/issues/68#issuecomment-1885740182, or unsubscribe https://github.com/notifications/unsubscribe-auth/ARGPZNHJ2ZQDY576AW775LLYN37ZPAVCNFSM5W7P5L42U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBYGU3TIMBRHAZA . You are receiving this because you were mentioned.Message ID: @.***>

-- P. Bryant Nagelson Bisbing Forest Ecology & Silviculture Lab Department of Natural Resources and Environmental Science University of Nevada, Reno (415) 971-9966

Jan 10 '24 21:01 bnagelson

Hi Julia, Thank you so much for your #TidyTuesday contributions. They are incredibly useful for learning from real datasets. I have a question regarding your latest visualization. I noticed the top high estimates of variables in your model. Could you clarify whether these variables are predicting the status of "died" or "survived"? Additionally, how can we set the binary outcome to specify which status we want to predict in the model? Thank you!

May 15 '24 16:05 NizePetcharat

@NizePetcharat In this case, the model coefficients are for predicting "survived" compared to "died". You can specify that by setting your factor levels by hand and/or by using the event_level argument in yardstick metrics, as shown here for sensitivity.

May 15 '24 17:05 juliasilge

Hi Julia, thanks again for the helpful material. I am curious about when you checked after baking members_rec, the number of died and survived was equal (56K-56K). However, the resampling results in each fold show a total of 51.6K/5.4K (57K), which matches the initial target total not the upsampling method. If I am wrong, please correct me. My question is, even we confirmed in the workflow that step_smote was applied, how can we ensure that all up-sampled data is included in the training process? Thank you for your time and assistance.

Jun 10 '24 14:06 NizePetcharat

@NizePetcharat You can read more about how subsampling is handled in these links:

https://www.tmwr.org/recipes#skip-equals-true
https://recipes.tidymodels.org/articles/Skipping.html
https://www.tidymodels.org/learn/models/sub-sampling/

When you upsample data, it is included during training, but not ever when evaluating, testing, or estimating performance; you don't want to evaluate performance on upsampled data but data with the original proportion of the class.

Jun 10 '24 20:06 juliasilge

juliasilge.com juliasilge.com copied to clipboard