torchmetrics
torchmetrics copied to clipboard
Recall and Specificity values are the same
🐛 Bug
I'm using Recall and Specificity module metrics when training a binary classification model. The graphs in MLFlow are the exact same. Please see code below for how I'm using the module metrics. I'm pretty sure I'm doing it correctly because I'm following the torchmetrics doc as close as I can.
Code sample
def ModelMetrics:
train_metrics=[
AUROC(
num_classes=out_features,
average='weighted' if out_features > 2 else 'macro',
),
],
valid_metrics=[
AUROC(
num_classes=out_features,
average='weighted' if out_features > 2 else 'macro',
),
Specificity(
num_classes=out_features,
average='weighted' if out_features > 2 else 'macro',
),
Recall(
num_classes=out_features,
average='weighted' if out_features > 2 else 'macro',
)
],
self.train_metrics: MetricCollection = MetricCollection(train_metrics, prefix='train_')
self.valid_metrics: MetricCollection = MetricCollection(valid_metrics, prefix='valid_')
def <phase>_step(...):
if phase == 'train':
head.train_metrics.update(preds, targets)
elif phase == 'valid':
head.valid_metrics.update(preds, targets)
def <phase>_epoch_end(...):
for head in self.model.model_heads:
# Use Metrics API
if phase == 'train':
for metric_name, metric in head.train_metrics.items():
self.log(f'{head.name}_{metric_name}', metric)
elif phase == 'valid':
for metric_name, metric in head.valid_metrics.items():
self.log(f'{head.name}_{metric_name}', metric)
return
Expected behavior
Expecting Recall and Specificity to be different values.
Environment
- TorchMetrics version (and how you installed TM, e.g.
conda,pip, build from source): pip, 0.9.2 - Python & PyTorch Version (e.g., 1.0): python 3.8.12, torch 1.12.0+cu116, pytorch-lightning 1.6.4
- Any other relevant information such as OS (e.g., Linux): AWS EC2 p3.8xlarge on Deep Learning AMI
Additional context
If you guys want any other info please let me know. See below for MLFlow graphs.

Hi! thanks for your contribution!, great first issue!
can you actually give a few sample data (just raw tensor numbers that you think should produce different outputs?
I am asking because I again checked both the formula and the implementation and they seem to be correct.
Ideally you can just generate random number with a fixed seed. But as I said, for me this is not reproducible
Issue will be fixed by classification refactor: see this issue https://github.com/Lightning-AI/metrics/issues/1001 and this PR https://github.com/Lightning-AI/metrics/pull/1195 for all changes
Small recap: This issue describe that metric Recall and Specificity are all the same in the binary setting, which is wrong. The problem with the current implementation is that the metrics are calculated as average over the 0 and 1 class, which makes all the scores collapse into the same metric essentially.
Using the new binary_* versions of all the metrics:
from torchmetrics.functional import binary_recall, binary_specificity
import torch
preds = torch.rand(10)
target = torch.randint(0, 2, (10,))
binary_recall(preds, target) # tensor(0.5000)
binary_specificity(preds, target) # tensor(0.6250)
which also corresponds to what sklearn is giving. Sorry for the confusion that this have given rise to. Issue will be closed when https://github.com/Lightning-AI/metrics/pull/1195 is merged.