juliasilge.com icon indicating copy to clipboard operation
juliasilge.com copied to clipboard

Class imbalance and classification metrics with aircraft wildlife strikes | Julia Silge

Open utterances-bot opened this issue 3 years ago • 12 comments

Class imbalance and classification metrics with aircraft wildlife strikes | Julia Silge

Handling class imbalance in modeling affects classification metrics in different ways. Learn how to use tidymodels to subsample for class imbalance, and how to estimate model performance using resampling.

https://juliasilge.com/blog/sliced-aircraft/

utterances-bot avatar Jun 22 '21 01:06 utterances-bot

Hi Dr. Silge, thanks for the analysis. I do have a question about the bag tree engine argument "times". How did you settle on 25 as the number of times to run the bag tree model? Is there more documentation that you can link to to better understand this? In some of your other analysis you've used different numbers.

Can you please explain that a little further? Is the times argument also used with there tree models? Thanks.

gunnergalactico avatar Jun 22 '21 01:06 gunnergalactico

@gunnergalactico Using times = 25 is probably a bit low for really good performance with a bagged tree model. You can read this section of the excellent HOML for more background on it.

juliasilge avatar Jun 22 '21 03:06 juliasilge

How did the mac mini perform? I am thinking of getting one but was hesitant because I thought the new mac chips were not compatible with a lot of data science tools.

daver787 avatar Jun 27 '21 13:06 daver787

I am having a really nice time with my Mac mini @daver787, and things are FAST. I even have gotten TensorFlow working. Some pain points for me right now are a few reticulate packages where data gets passed back and forth between Python and R between native ARM and the Rosetta emulation mode, which can be painfully slow when you have a lot of resampling folds, and I can't get catboost natively installed on it. If I am working all in R, I am quite happy. My take is that native support in R is better than in Python as of right now.

juliasilge avatar Jun 27 '21 16:06 juliasilge

done

Ji-square avatar Jun 30 '21 02:06 Ji-square

Hello may I ask whether the step_zv should be the last preprocessing step? should it goes after step_dummy? or step_smote? or currently it is okay already? Because let's say I try another model like logistic regression then warnings about rank-deficiency is thrown out.

harris-yh-wong avatar Jul 06 '21 10:07 harris-yh-wong

@harris-yh-wong We outline some advice on ordering of recipe steps here that may be helpful but it doesn't talk about subsampling to address class imbalance there. In general, a subsampling step should be last in your feature engineering; I think I'd do it after step_zv() (which should also be pretty late).

juliasilge avatar Jul 06 '21 15:07 juliasilge

This was a very interesting read. My basic knowledge of Difference between Analysis and Analytics helped me understand this in a much better way.

Chaarvi269 avatar Jul 18 '21 11:07 Chaarvi269

@juliasilge Hi Julia, in the preprocessing step, you have used few steps to handle some missing values in factors variables of training set. As far as I understand, in this step, you used step_novel to assign missing value in training set to a new level in testing set (if its available), and used step_unknown to assign missing value in training set to unknown class (also a new level). Does these 2 steps similar to each other and can we only use one of them at one time to preprocess the data?

conlelevn avatar Jun 29 '22 03:06 conlelevn

You can read more about these two steps, which handle new levels (levels that are new at prediction time or in the test data, not in the training data) and missing levels:

  • https://recipes.tidymodels.org/reference/step_novel.html
  • https://recipes.tidymodels.org/reference/step_unknown.html

juliasilge avatar Jun 29 '22 13:06 juliasilge

@juliasilge I guess that instead of: bird_folds <- vfold_cv(train_raw, v = 5, strata = damaged)

It should be: bird_folds <- vfold_cv(bird_df, v = 5, strata = damaged)

jrosell avatar Jan 25 '24 16:01 jrosell

@jrosell Ah yep, looks like I intended to not carry some of those other variables around through the rest of the modeling. 👍

juliasilge avatar Jan 25 '24 19:01 juliasilge