torchmetrics icon indicating copy to clipboard operation
torchmetrics copied to clipboard

Possible bug in binary classification `calibration_error`

Open cwognum opened this issue 3 years ago • 1 comments

🐛 Bug

In calibration_error(), the accuracies in the binary classification setting are not correctly computed I think. It just returns the targets. I am guessing this should rather return target == preds.round().int() or something similar? Am I missing something?

Code example

import torch
from torchmetrics.functional.classification import calibration_error

preds = torch.tensor([0.01, 0.001, 0.005])  # The raw sigmoid output
targets = torch.tensor([1, 1, 1])
calibration_error(confidences, targets)
# This returns: tensor(0.9947)

The model confidently predicts the wrong class, but is rewarded with a near perfect calibration score.

Environment

  • TorchMetrics version (and how you installed TM, e.g. conda, pip, build from source):

    • Version 0.9.1
    • Installed with mamba
  • Python & PyTorch Version (e.g., 1.0):

    • Python: 3.9.13
    • PyTorch: 1.11.0.post202
  • Any other relevant information such as OS (e.g., Linux):

    • I am on Ubuntu, Linux.

cwognum avatar Jun 21 '22 16:06 cwognum

Added a little example to better illustrate my point.

By the way, using just a 0-vector would have been a simpler example, but it turns out the preds can't be 0 exactly due to how the binning is done. It could make sense to clamp the predictions in the binning process to prevent this. E.g.:

torch.clip(confidences, 1e-6, 1.0)

cwognum avatar Jun 22 '22 14:06 cwognum

Hi, I checked this issue as an bigger refactor (see this issue https://github.com/Lightning-AI/metrics/issues/1001 and this PR https://github.com/Lightning-AI/metrics/pull/1195) and it seems that our calibration error is computing the right value.

First, in the example provided the metric is giving a score of 0.9942. As the metric is an calibration error the optimum would be 0 and not 1 and it therefore seems correct that the metric is giving a high score as the example is clearly not well calibrated.

Secondly I ran the example through an third party package https://github.com/fabiankueppers/calibration-framework which gives the same result as our implementation (we are actually using it for testing now).

Therefore, there does not seem to be an error in the implementation. Closing issue.

SkafteNicki avatar Aug 30 '22 13:08 SkafteNicki