parsnip icon indicating copy to clipboard operation
parsnip copied to clipboard

In classification problems, merging `probably` package when determining best threshold.

Open SHo-JANG opened this issue 2 years ago • 3 comments

As far as I can understand, we're using prob_to_class_2 as the default option when predicting class.

prob_to_class_2 <- function(x, object) {
  x <- ifelse(x >= 0.5, object$lvl[2], object$lvl[1])
  unname(x)
}

However, in many cases, the threshold is not 0.5. (Especially in imbalanced datasets.)

In this case, I wonder if we could use the threshold_perf() function in the probably package during the tuning process to check if the model is potentially classifying really well.

I think it's a really necessary feature, what do you think?

SHo-JANG avatar Jul 08 '23 07:07 SHo-JANG

It is an important feature. After the posit conference, we will be working on post-processing tools and this is one of them.

We'll try to make it natural so that you can treat the threshold parameter like any other tuning parameter. If you use a workflow, it will also adjust the hard class predictions automatically (once you've picked a threshold).

topepo avatar Jul 08 '23 14:07 topepo

Thank you so much for all the hard work you do to make the system more complete.

SHo-JANG avatar Jul 09 '23 07:07 SHo-JANG

I think that hyperparameterizing to find the optimal threshold would be time consuming and could lead to overfitting.

Instead , I searched for a way to determine the optimal threshold. related paper

In Section 2.3. Threshold criteria, (6)PredPrev = Obs. This means that we want the class ratio of the predicted result to be equal to the ratio of the observed classes in the trained data, i.e., we use quantile(probs = 1- "Obs class ratio") from the predicted probability vector as the threshold.

The code to implement this in the training process is as follows.


prob_to_class_2_custom <- function(x, object) {
  obs_ratio<- object$fit$y |> mean()
  pred_equal_obs_threshold <- quantile(x,probs = 1-obs_ratio)
  x <- ifelse(x >= pred_equal_obs_threshold, object$lvl[2], object$lvl[1])
  unname(x)
}

I would like to use this function as the default option. However, it seems that I need to redefine the engine to apply this function. Is there any way to use this function in an existing engine?

SHo-JANG avatar Oct 26 '23 05:10 SHo-JANG

Long time no see😝 We've got some good news here, though—custom probability thresholds and other postprocessing functionality is now available via tailors, which can be added to workflows in the dev version of the workflows package. You can read more on that work on this blog post.

Since these changes will otherwise live on the tailor repo, I'm going to go ahead and close!

simonpcouch avatar Oct 08 '24 16:10 simonpcouch

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex: https://reprex.tidyverse.org) and link to this issue.

github-actions[bot] avatar Oct 24 '24 01:10 github-actions[bot]