torchmetrics icon indicating copy to clipboard operation
torchmetrics copied to clipboard

F1, Accuray, Precision and Recall all output the same value consistently in a binary classification setting.

Open FeryET opened this issue 2 years ago • 11 comments

🐛 Bug

I am trying to report F1, Accuracy, Precision and Recall for a binary classification task. I have collected these metrics in a MetricCollection module, and run them for my train, val and test stages. Upon inspecting the results, I can see that all of these metrics are showing the exact same value.

To Reproduce

Create a random binary classification task and add these metrics together in a metric collection.

Code sample

I have uploaded a very minimal example in this notebook. As you can see the values reported by torchmetrics doesn't align with classification_report.

Expected behavior

F1, Precision, Recall and Accuracy should usually differ. It should be very unlikely to see all of them match exactly.

Environment

  • PyTorch Version (e.g., 1.0): 1.10.0
  • OS (e.g., Linux): Ubuntu 20.04.
  • How you installed PyTorch conda.
  • Build command you used (if compiling from source):
  • Python version: 3.9.
  • CUDA/cuDNN version: None.
  • GPU models and configuration: None.
  • Any other relevant information: None.

Additional context

I have also asked this question in a discussion form yesterday thinking it was a problem on my part, but after looking up the sitaution, I think this might be a bug.

https://github.com/PyTorchLightning/metrics/discussions/743

FeryET avatar Jan 12 '22 06:01 FeryET

Hi! thanks for your contribution!, great first issue!

github-actions[bot] avatar Jan 12 '22 06:01 github-actions[bot]

seem to be duplicate to #543 if you still find this in need, feel free to reopen :rabbit:

Borda avatar Jan 12 '22 11:01 Borda

seem to be duplicate to #543 if you still find this in need, feel free to reopen rabbit

Sorry if I'm reopening the issue, but I think this is at the very least an issue with the documentation. The way the micro option is computed is confusing. Can you explain to me how exactly these options work in torchmetrics? I read the documentation before using these but was sure micro was the flag I should have used.:-/ I write what I had thought first upon inspecting the documentations, so you could see why I was confused.

From what I know regardless of micro and macro, F1 and Accuracy should yield different results. In terms of micro, Accuracy should look at all samples and computes the portion of matching samples in the array in micro while in macro should compute class-wise accuracy and then average them (weighted or not). While F1 is predominantly a binary classification metric, and should compute recall and precision and average them. In a multiclass manner this should be done as a one vs rest mechanism and then averaged (or weighted average). So I don't understand the micro F1 vs macro F1 at all. Same thing with Precision and Recall.

FeryET avatar Jan 12 '22 11:01 FeryET

cc: @SkafteNicki @aribornstein

Borda avatar Jan 12 '22 11:01 Borda

Hi, I'm having the same issue on:

  • torch 1.9.1
  • torchmetrics 0.7.2

dangne avatar Mar 22 '22 03:03 dangne

Hi, I'm having the same issue on: accuracy

Waterkin avatar Apr 06 '22 13:04 Waterkin

I have encountered this issue and here is a Colab notebook to replicate the issue and the solution.

I agree with @FeryET, the setup is confusing and it would be great if there is a warning or a better example to showcase the difference.

ma7dev avatar Apr 21 '22 08:04 ma7dev

Also, can we add a "binary" option for average so that we can compute the original recall score for binary classes?

Like

import sklearn.metrics as metrics
import numpy as np

a = np.array([1, 1, 0, 0, 0, 1])
b = np.array([0, 1, 1, 1, 0, 1])

metrics.recall_score(a, b, average='binary') # 0.6666666666666666

rasbt avatar Jun 03 '22 16:06 rasbt

I am also having this issue as well - is there a simple way to fix this?

griff4692 avatar Jun 09 '22 02:06 griff4692

Hi, I encountered a similar issue when using the Precision metric in MetricCollection. However, the output was always zero rather than consistent with other metrics. Changing the compute_groups to False fixed my problem. Hope it will be helpful.

lucienwang1009 avatar Jun 17 '22 08:06 lucienwang1009

I encountered the same issue. As @lucienwang1009 said, initializing MetricCollection with compute_groups being false works. For example,

from torchmetrics import MetricCollection, Precision
MetricCollection(
    {'P@8': Precision(num_classes=8), 'P@15': Precision(num_classes=15)},
    compute_groups=False
)

Some detailed observations:

  • My results of P@8 and P@15 are correct on the validation data, but my values of P@8 and P@15 are exactly the same when testing on the testing set. I think that the bug might be related to the data.
  • When I only place one of [P@8, P@15] when inferencing on the testing data, the values are correct.
  • The compute_groups might group [P@8, P@15] and perform some incorrect operations that cause the problem.

The bug is not easy to be observed, and I take hours to check other places like data pre-processing, training and testing scripts, package versions, and so on. I think this is a critical bug that needs to be solved as soon as possible, or simply, the default value of compute_groups should be false.

JamesLYC88 avatar Jul 20 '22 17:07 JamesLYC88

Issue will be fixed by classification refactor: see this issue https://github.com/Lightning-AI/metrics/issues/1001 and this PR https://github.com/Lightning-AI/metrics/pull/1195 for all changes

Small recap: This issue describe that metric F1, Accuracy, Precision and Recall are all the same in the binary setting, which is wrong. The problem with the current implementation is that the metrics are calculated as average over the 0 and 1 class, which makes all the scores collapse into the same metric essentially.

Using the new binary_* versions of all the metrics:

from torchmetrics.functional import binary_accuracy, binary_precision, binary_recall, binary_f1_score
preds = tensor([0.4225, 0.5042, 0.1142, 0.4134, 0.0978, 0.1402, 0.9422, 0.4846, 0.1639, 0.6613])
target = tensor([1, 1, 1, 1, 1, 1, 1, 0, 1, 1])
binary_accuracy(preds, target) # tensor(0.4000)
binary_recall(preds, target) # tensor(0.3333)
binary_precision(preds, target) # tensor(1.)
binary_f1_score(preds, target) # tensor(0.5000)

which also corresponds to what sklearn is giving. Sorry for the confusion that this have given rise to. Issue will be closed when https://github.com/Lightning-AI/metrics/pull/1195 is merged.

SkafteNicki avatar Aug 28 '22 11:08 SkafteNicki