juliasilge.com
juliasilge.com copied to clipboard
Multinomial classification with tidymodels and #TidyTuesday volcano eruptions | Julia Silge
Multinomial classification with tidymodels and #TidyTuesday volcano eruptions | Julia Silge
Lately I’ve been publishing screencasts demonstrating how to use the tidymodels framework, from first steps in modeling to how to evaluate complex models. Today’s screencast demonstrates how to implement multiclass or multinomial classification using with this week’s #TidyTuesday dataset on volcanoes.
Thank you so much Dr Silge, this is exactly what I've been hunting for!
Thanks Julia,
In this model I didnt see you use tuning hyperparameter for RF model, is there any specific reason for it? in practical, do you usually see any significant different in model performance before and after this tuning process?
@conlelevn Random forest models tend to perform pretty well without tuning, as long as you use "enough" trees (like 1000 or so). You can tune a random forest if you want to eke out a little more performance; I demonstrate how to do that here but typically you don't see a ton of dramatic improvement (unlike when you tune an xgboost model).
Hi Julia,
Many thanks to the wonderful blog! In your example you showed how a recipe works on the training data as a whole (since you don't tune hyperparameter). I am wondering if you can shed some light on how recipe processing can be visible for the resampling object used for parameter tunning?
For example ,given a nested_cv process where each training data from the outer loop is used to generate a resampling object for hyperparameter tunning, how to confirm the upsampling is working properly, in only analysis sets but not assessment sets?
@Wenyu1024 You can read about how preprocessing works over resamples (in the context of parallel processing) in this section of Tidy Modeling with R; note the difference between parallel_over = "resamples"
and parallel_over = "everything"
. If you are tuning in serial, it will, as expected, preprocess then fit for the resamples sequentially.
If you are using a nested resampling scheme, then you will need to set some of this up yourself, as outlined here.
Very useful post as always, Dr. Silge. I have learned a lot about tidymodels from your posts. Thank you very much!
Hello Julia.
I was wondering how to use the vip "permute"
method discussed here (https://github.com/koalaverse/vip/issues/131) with multiple classes, like the volcano data? Is it possible with metric = "mauc"
and then somehow specifiying the pred_fun to average over the classes; or would I need to use prediction=FALSE
and metric = "accuracy"
; or something else entirely?
Many thanks :-)
@smithhelen Hmmm, I'm not sure. Can you create a small reprex (a minimal reproducible example) for this? The goal of a reprex is to make it easier for folks to recreate your problem so that we can understand it and/or fix it. If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. Once you have a reprex, I recommend posting on RStudio Community, which is a great forum for getting help with these kinds of modeling questions. Thanks! 🙌
Thank you Julia
I'll make my question a bit clearer :-)
In this volcano example you generate vi scores using the inbuilt importance="permutation"
option via set_engine
. Even though a probability forest (rather than a classification forest) is grown, these vi scores are measured from the change in classification accuracy (as per the ranger
documentation).
In a different example, using bivariate data
, you generate vi scores using the method = "permute"
option in the vip
package and do not specify importance = "permutation"
within set_engine
. Now, for a probability forest, the vi scores will be calculated using the auc method metric = "auc"
(versus metric="accuracy"
for a classification forest (i.e. when set_engine(..., probability = FALSE)
). For the auc method, a reference class needs to be specified for both the pred_wrapper and vi().
Here is your code for the bivariate data, where you choose the reference class to be "One" (i.e. $.pred_One
and reference_class = "One"
)
pred_fun <- function(object, newdata) {
predict(object, new_data = newdata, type = "prob")$.pred_One
}
ranger_fit %>%
vi(method = "permute", target = "Class", metric = "auc", nsim = 10,
pred_wrapper = pred_fun, train = bivariate_train, reference_class = "One")`
An advantage of using the vip approach is that multiple simulations can be run and a boxplot produced.
My questions:
- is it possible to use the
vip(method = "permute" ...)
method to calculate vi scores when there are more than two classes (as for the volcano example) because what would thereference_class
be? - if 1. is not possible, then is it sensible to grow a classification forest and use
vip
withmetric = "accuracy"
instead?
Thank you!
Ah OK @smithhelen, I don't know how/if the vip package "permute"
method works for multinomial classification (although you can ask over at the vip GH repo, so maybe they can clarify). You probably want to use something like DALEX instead; you can read more about using DALEX with tidymodels here.
Ahh, awesome, I'll have a read - thank you 🙂
From: Julia Silge @.> Sent: Monday, 3 April 2023 2:15 pm To: juliasilge/juliasilge.com @.> Cc: Helen Smith @.>; Mention @.> Subject: Re: [juliasilge/juliasilge.com] Multinomial classification with tidymodels and #TidyTuesday volcano eruptions | Julia Silge (Issue #57)
Ah OK @smithhelenhttps://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsmithhelen&data=05%7C01%7Ch.l.smith%40massey.ac.nz%7Cd44270dd9fd8482b489708db33e9431d%7C388728e1bbd0437898dcf8682e644300%7C1%7C0%7C638160849176761611%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=PeqmMivamYS1Mm3duOVdn7eTB%2BId%2FbhQDnYF985uRyk%3D&reserved=0, I don't know how/if the vip package "permute" method works for multinomial classification (although you can ask over at the vip GH repo, so maybe they can clarify). You probably want to use something like DALEX insteadhttps://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fmodeloriented.github.io%2FDALEX%2Farticles%2Fmultilabel_classification.html&data=05%7C01%7Ch.l.smith%40massey.ac.nz%7Cd44270dd9fd8482b489708db33e9431d%7C388728e1bbd0437898dcf8682e644300%7C1%7C0%7C638160849176761611%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=NMeZkvtXd3%2BgxOHSE93mUiFtY4jICF%2BgcmmsI1mpdB4%3D&reserved=0; you can read more about using DALEX with tidymodels herehttps://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.tmwr.org%2Fexplain.html&data=05%7C01%7Ch.l.smith%40massey.ac.nz%7Cd44270dd9fd8482b489708db33e9431d%7C388728e1bbd0437898dcf8682e644300%7C1%7C0%7C638160849176761611%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=c7nddA2g7Syd%2BHd2UZcz5%2Bq8TUF4dsLE343j%2FxOJT2M%3D&reserved=0.
— Reply to this email directly, view it on GitHubhttps://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fjuliasilge%2Fjuliasilge.com%2Fissues%2F57%23issuecomment-1493540817&data=05%7C01%7Ch.l.smith%40massey.ac.nz%7Cd44270dd9fd8482b489708db33e9431d%7C388728e1bbd0437898dcf8682e644300%7C1%7C0%7C638160849176761611%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=BQ8i3kxPsP4e%2BDEbs6qoacV%2FeGwaAFA%2BACgwm2E6FBE%3D&reserved=0, or unsubscribehttps://apc01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FANZS6KI3HOMZ75QMPIFCHLDW7IXDBANCNFSM5KAXW2JQ&data=05%7C01%7Ch.l.smith%40massey.ac.nz%7Cd44270dd9fd8482b489708db33e9431d%7C388728e1bbd0437898dcf8682e644300%7C1%7C0%7C638160849176761611%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=SRqZ0SHZ6Dr3%2BXo7IVzSGBd1Tf%2F0QczQ8uxA9mfnSmc%3D&reserved=0. You are receiving this because you were mentioned.Message ID: @.***>
@juliasilge and @smithhelen sorry I'm late to the party. Starting work on the next version of vip now. In short, permutation importance works the same way for multiclass problems as it does for the binary and regression cases. In fact, the vip, iml, and ingredients (the DALEX package for variable importance) packages are all flexible enough to support ANY type of model; even ones built in Python. You just need to supply a suitable metric function and a corresponding prediction wrapper. Here's a somewhat minimal example using a multiclass random forest and the Brier score metric via yardstick:
library(ranger)
set.seed(1028)
rfo <- ranger(Species ~ ., data = iris, probability = TRUE)
p <- predict(rfo, data = iris)$predictions
head(p)
# setosa versicolor virginica
# [1,] 1.0000000 0.0000000000 0.0000000000
# [2,] 0.9963333 0.0030000000 0.0006666667
# [3,] 1.0000000 0.0000000000 0.0000000000
# [4,] 1.0000000 0.0000000000 0.0000000000
# [5,] 1.0000000 0.0000000000 0.0000000000
# [6,] 0.9994286 0.0005714286 0.0000000000
# Multiclass Brier score
yardstick::brier_class_vec(iris$Species, estimate = p)
# Prediction wrapper; to use multiclass Brier score, needs to return matrix of
# predicted probabilities
pfun <- function(object, newdata) {
predict(object, data = newdata)$predictions
}
# Metric function; just a thin wrapper around yardstick's Brier score function
mfun <- function(actual, predicted) {
yardstick::brier_class_vec(actual, estimate = predicted)
}
# Compute permutation importance
vi_permute(
rfo,
train = iris,
target = "Species",
metric = mfun,
pred_wrapper = pfun, # tells vip how to get predictions form this model
smaller_is_better = TRUE, # vip has no idea if smaller or larger is better
nsim = 10
)
# # A tibble: 4 × 3
# Variable Importance StDev
# <chr> <dbl> <dbl>
# 1 Sepal.Length 0.00867 0.00103
# 2 Sepal.Width 0.00223 0.000552
# 3 Petal.Length 0.149 0.00885
# 4 Petal.Width 0.171 0.00798
# Same, but with sorted output
vi(
rfo,
method = "permute",
train = iris,
target = "Species",
metric = mfun,
pred_wrapper = pfun, # tells vip how to get predictions form this model
smaller_is_better = TRUE, # vip has no idea if smaller or larger is better
nsim = 10
)
# # A tibble: 4 × 3
# Variable Importance StDev
# <chr> <dbl> <dbl>
# 1 Petal.Width 0.178 0.0116
# 2 Petal.Length 0.151 0.0120
# 3 Sepal.Length 0.00921 0.000942
# 4 Sepal.Width 0.00232 0.000662
Note that I am working to incorporate yardstick into the package to make it a bit easier by not having to write your own metric function each time (but that's where the flexibility comes in). Also, I wrote vip with scale in mind, and it's seemingly faster than alternatives, so keep that in mind. A simple benchmark can be found in out R Journal article (Figure 16). It's also parallelizable via the foreach package for larger problems.