juliasilge.com
juliasilge.com copied to clipboard
Class imbalance and classification metrics with aircraft wildlife strikes | Julia Silge
Class imbalance and classification metrics with aircraft wildlife strikes | Julia Silge
Handling class imbalance in modeling affects classification metrics in different ways. Learn how to use tidymodels to subsample for class imbalance, and how to estimate model performance using resampling.
Hi Dr. Silge, thanks for the analysis. I do have a question about the bag tree engine argument "times". How did you settle on 25 as the number of times to run the bag tree model? Is there more documentation that you can link to to better understand this? In some of your other analysis you've used different numbers.
Can you please explain that a little further? Is the times argument also used with there tree models? Thanks.
@gunnergalactico Using times = 25
is probably a bit low for really good performance with a bagged tree model. You can read this section of the excellent HOML for more background on it.
How did the mac mini perform? I am thinking of getting one but was hesitant because I thought the new mac chips were not compatible with a lot of data science tools.
I am having a really nice time with my Mac mini @daver787, and things are FAST. I even have gotten TensorFlow working. Some pain points for me right now are a few reticulate packages where data gets passed back and forth between Python and R between native ARM and the Rosetta emulation mode, which can be painfully slow when you have a lot of resampling folds, and I can't get catboost natively installed on it. If I am working all in R, I am quite happy. My take is that native support in R is better than in Python as of right now.
done
Hello may I ask whether the step_zv should be the last preprocessing step? should it goes after step_dummy? or step_smote? or currently it is okay already? Because let's say I try another model like logistic regression then warnings about rank-deficiency is thrown out.
@harris-yh-wong We outline some advice on ordering of recipe steps here that may be helpful but it doesn't talk about subsampling to address class imbalance there. In general, a subsampling step should be last in your feature engineering; I think I'd do it after step_zv()
(which should also be pretty late).
This was a very interesting read. My basic knowledge of Difference between Analysis and Analytics helped me understand this in a much better way.
@juliasilge Hi Julia, in the preprocessing step, you have used few steps to handle some missing values in factors variables of training set. As far as I understand, in this step, you used step_novel to assign missing value in training set to a new level in testing set (if its available), and used step_unknown to assign missing value in training set to unknown class (also a new level). Does these 2 steps similar to each other and can we only use one of them at one time to preprocess the data?
You can read more about these two steps, which handle new levels (levels that are new at prediction time or in the test data, not in the training data) and missing levels:
- https://recipes.tidymodels.org/reference/step_novel.html
- https://recipes.tidymodels.org/reference/step_unknown.html
@juliasilge I guess that instead of: bird_folds <- vfold_cv(train_raw, v = 5, strata = damaged)
It should be: bird_folds <- vfold_cv(bird_df, v = 5, strata = damaged)
@jrosell Ah yep, looks like I intended to not carry some of those other variables around through the rest of the modeling. 👍