torchmetrics Cannot use `average="none"` in v0.11?

🐛 Bug

Using version 0.11 and trying to use Dice with average="none", I get the following error:

ValueError: The `reduce` none is not valid.

Additional context

There is a check for the allowed average values in the Dice class. This permits "none".

However, comparing v0.11 and v0.10.3, it looks like a second check was added further down in the class during this PR.

if average not in ["micro", "macro", "samples"]:
            raise ValueError(f"The `reduce` {average} is not valid.")

Since that doesn't check for average of "none", and since average isn't redefined above it, I think that secondary check added in #1252 causes the bug.

Jan 03 '23 23:01 carbocation

Hi! thanks for your contribution!, great first issue!

Jan 03 '23 23:01 github-actions[bot]

Hey,

Is there any progress on this? :)

I stumbled upon the same problem and needed a quick fix, so I dug a bit deeper. It seems more complicated than removing this if statement. For TLDR, I wrote a quick and dirty fix which allows to instantiate the Dice class with average set to weighted or none without any errors. However, I did not check if the computed score is actually correct. Feel free to check it out: https://github.com/blazejdolicki/metrics/commit/5a226431f6beee606b1db9a45bab1f19fdd449d9

Furthermore, I don't understand some changes made by #1252 in __init__ of Dice. For instance, right now it assigns kwargs["reduce"] only after calling super().init(**kwargs) so essentially the new value of kwargs["reduce"] is never used. I have very little idea about the library implementation, but this doesn't seem right. Looking at this PR it seems that "weighted" and "none" average weren't taken into account at all in the changes.

Please let me know if I'm missing something.

Jan 23 '23 13:01 blazejdolicki

Update: When updating the metric state, I ran into a new error regarding shape mismatch of self.tp. It seems that I needed to change how self.reduce was initialized. My guess is that self.reduce used to be initialized in the super class (Metric), but in #1252 it was moved to Dice class, but with incorrect initalization. Here is a commit with the update: https://github.com/blazejdolicki/metrics/commit/f94b9e1ee16b9f162d64733ac7de3fe0663058c8 Now the training script finishes without errors.

Jan 24 '23 12:01 blazejdolicki

Hi everyone, thanks for reporting this issue. We are actually in the process of completely deprecating Dice as it is mathematically the same as F1Score. Therefore please use that metric instead. We have some work thats been going on in PR https://github.com/Lightning-AI/metrics/pull/1090 (that is sadly a bit stagnated at the moment) that is going to deprecate Dice but instead add GeneralizedDice as an metric.

Jan 31 '23 13:01 SkafteNicki

Hi everyone, thanks for reporting this issue. We are actually in the process of completely deprecating Dice as it is mathematically the same as F1Score. Therefore please use that metric instead. We have some work thats been going on in PR #1090 (that is sadly a bit stagnated at the moment) that is going to deprecate Dice but instead add GeneralizedDice as an metric.

I cannot get the expected Dice result using MulticlassF1Score. See below this toy example where I calculate the Dice for class 1 and 2 (ignoring 0). My implementation and the MONAI implementation agree, but the F1 score does not (and also still returns a tensor with length of classes including 0).


from monai.metrics import DiceMetric as dice
from torchmetrics.classification import MulticlassF1Score as F1_score
import torch
from torch.nn.functional import one_hot

calc = dice(include_background=False, reduction='mean_batch')
F1 = F1_score(num_classes=3, multidim_average='global', average='none', ignore_index=0)

def with_monai(pred, gt, calc):
    pred = one_hot(pred, num_classes=3)
    gt = one_hot(gt, num_classes=3)
    pred = torch.permute(pred, (0,3,1,2))
    gt = torch.permute(gt, (0,3,1,2))
    return calc(pred, gt)

def my_dice(pred, gt):
    pred1map = (pred==1).type(torch.int64)
    pred2map = (pred==2).type(torch.int64)
    gt1map = (gt==1).type(torch.int64)
    gt2map = (gt==2).type(torch.int64)
    c1 = 2*torch.sum(pred1map * gt1map)/ (torch.sum(pred1map) + torch.sum(gt1map))
    c2 = 2*torch.sum(pred2map * gt2map)/ (torch.sum(pred2map) + torch.sum(gt2map))
    return torch.tensor([c1, c2])

test_pred = torch.tensor([[[1, 1], [2, 0]]]).type(torch.int64)
test_gt = torch.tensor([[[1, 0], [2, 1]]]).type(torch.int64)

print(test_pred.shape)
# torch.Size([1, 2, 2])

print(test_pred)
# tensor([[[1, 1],
           [2, 0]]])

print(test_gt)
# tensor([[[1, 0],
           [2, 1]]])

print(F1(test_pred, test_gt))
# tensor([0.0000, 0.6667, 1.0000])

print(with_monai(test_pred, test_gt, calc))
# tensor([[0.5000, 1.0000]])

print(my_dice(test_pred, test_gt))
# tensor([0.5000, 1.0000])

Feb 17 '23 20:02 Jesse-Phitidis

Okay, I have figured it out!

The issue is due to my misunderstanding of the ignore_index argument.

Ignoring index 0 actually changes the results for index 1 and 2, because the interpretation is "if pred OR gt is 0, do not consider this voxel", so false positive in the background and false negatives where we predict background do not get considered in dice score. I would have expected the ignore_index option to simply not calculate metrics for the ignored index (and maybe output a tensor with one fewer elements e.g. [index=x1, index2=y] not [index0=0, index1=x, index2=y]).

On Fri, Jul 28, 2023 at 8:46 AM Jan Karwowski @.***> wrote:

I think that you should set multidim_average="samplewise" to achieve the same results.

— Reply to this email directly, view it on GitHub https://github.com/Lightning-AI/torchmetrics/issues/1425#issuecomment-1655208569, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZWRUXD2D547DWTQ6N3N7MTXSNU4FANCNFSM6AAAAAATQF45XI . You are receiving this because you commented.Message ID: @.***>

Sep 13 '23 11:09 Jesse-Phitidis

The issue is due to my misunderstanding of the ignore_index argument.

Would you think that some rewording is needed?

Sep 14 '23 14:09 Borda

Personally, I can't think of any situations where the current behaviour is useful (but perhaps this is just biases from my own field of work). I think that this behaviour would not be the expected behaviour for most people and unless they were to carefully test the function, they wouldn't notice what is actually happening and would have inflated metrics.

My vote would be to change the computation, but if not, then yes, rewording would be helpful I think.

On Thu, 14 Sept 2023, 15:55 Jirka Borovec, @.***> wrote:

The issue is due to my misunderstanding of the ignore_index argument.

Would you think that some rewording is needed?

— Reply to this email directly, view it on GitHub https://github.com/Lightning-AI/torchmetrics/issues/1425#issuecomment-1719617940, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZWRUXDUIZ3A7KECBXTFWZTX2MLGFANCNFSM6AAAAAATQF45XI . You are receiving this because you commented.Message ID: @.***>

Sep 14 '23 16:09 Jesse-Phitidis

I can confirm same behaviour of ignore_index for other multiclass classification metrics (like precision, recall, f1) on v1.2.0.

When using the ignore_index value, that class is simply filtered out from the labels, so the corresponding row of the confusion matrix is zero. Filtering is done here: stat_scores.py

When calculating with macro averaging, the value is biased towards 0 because the TP of ignored class is 0, leading to overall score of 0. Averaging is performed here: compute.py

The weighted averaging removes this problem because the weight of the ignored class is set to 0, but then the other classes are re-weighted as well according to their frequency.

I agree with @Jesse-Phitidis, the current implementation either accounts for all classes (bias towards class imbalance) or doesn't account for false-positives (promotes trivial classifiers). The parameter ignore_class sounds like it ignores that class from the final averaging in macro or weighted (which would give the more complete result), but actually just ignores any TP&FP for that class: If y=0 and model says 1, that is ok, (but not vice-versa).

My suggestion is to keep the current ignore_index (with more elaborate docs on its effect) for backwards compatibility of any unforseen usages people might have for it and add a new parameter background_index to stop the background class from contributing to the final averaging macro and weighted, while keeping its influence on the FPs. Combining the two parameters also fixes the bias towards 0 of the current implementation.

Sep 25 '23 13:09 lirfu