modAL
modAL copied to clipboard
Adding uncertainty calculation with a score threshold
Hi modAL team, I have been using the classifier_uncertainty function in the uncertainty module:
def classifier_uncertainty(classifier: BaseEstimator, X: modALinput, **predict_proba_kwargs) -> np.ndarray:
"""
Classification uncertainty of the classifier for the provided samples.
Args:
classifier: The classifier for which the uncertainty is to be measured.
X: The samples for which the uncertainty of classification is to be measured.
**predict_proba_kwargs: Keyword arguments to be passed for the :meth:`predict_proba` of the classifier.
Returns:
Classifier uncertainty, which is 1 - P(prediction is correct).
"""
# calculate uncertainty for each point provided
try:
classwise_uncertainty = classifier.predict_proba(X, **predict_proba_kwargs)
except NotFittedError:
return np.ones(shape=(X.shape[0], ))
# for each point, select the maximum uncertainty
uncertainty = 1 - np.max(classwise_uncertainty, axis=1)
return uncertainty
While this
uncertainty = 1 - np.max(classwise_uncertainty, axis=1)
works perfectly with classification models, I have come across a situation where I had a regression model/probabilistic classifier with a user-define threshold (applied on the predicted probabilities) to determine the class. e.g. setting threshold=0.75, a sample with a probability >0.75 will be TRUE, otherwise will be FALSE. Then the uncertainty for this sample will be the difference between the probability and the threshold.
If you think this would be a good feature to implement, I have written a function similar to classifier_uncertainty with an added threshold argument to deal with this scenario and am happy to contribute. If this has already been done in the existing codes, please let me know. Many thanks!
Hi!
Sure, this sounds like an useful uncertainty function! I would be happy if you could add it. Do you have any literature references using this method?
I have a question about it. Suppose that we set a 0.6 threshold and we have 0.9 predicted probability. Would the uncertainty be 0.3 in this case? It seems strange that this instance would have the same uncertainty as a possibly different instance with 0.3 prediction probabilities.
Hey Tivadar, thanks for your comment!
The way I calculate the uncertainty is 1 / (1 + abs(threshold - score))
where the threshold is the boundary value set by the user, and the score is the predicted probability for a particular data point. In your example, the uncertainty will be 1/(1+|0.6-0.9|) = 0.77
and 1/(1+|0.6-0.3|) = 0.77
.
Hence, my proposed function will be:
def boundary_uncertainty(classifier: BaseEstimator, X: modALinput, score_threshold, normalize=True, **predict_proba_kwargs) -> np.ndarray:
"""
Measure the uncertainty of the classifier in making decision around the score/probability threshold.
Args:
classifier: The probabilistic classifier or binary model for which the uncertainty is to be measured.
X: The samples for which the uncertainty of classification is to be measured.
score_threshold: The threshold that sets the boundary for the binary classification.
normalize: Whether to normalize the uncertainty within the given sample set.
**predict_proba_kwargs: Keyword arguments to be passed for the :meth:`predict_proba` of the classifier.
Returns:
Classifier uncertainty, which is based on how close the predicted scores to the decision boundary.
"""
# calculate uncertainty for each point provided
try:
scores = classifier.predict_proba(X, **predict_proba_kwargs)[:, 1]
except NotFittedError:
return np.ones(shape=(X.shape[0], ))
# for each point, calculate the reciprocal of the distance between the point and the score threshold
uncertainty = [1 / (1 + abs(score_threshold - score)) for score in scores]
if normalize:
return [(u - min(uncertainty)) / (max(uncertainty) - min(uncertainty)) for u in uncertainty]
return uncertainty
In terms of literature references, I have not got any yet. Please bear in mind, the uncertainty it is measuring here is not the model erroring rate etc. It is merely measuring how confident the model says a data point should belong to class A or class B. In my mind, this fits the purpose of active learning, as we would like the model to see more points around the boundary value.
Open to discussion & any modification :) I forked the repo and committed the initial change; will look into writing a unit test.