torchmetrics `binary_auroc` has various bugs on MPS device

🐛 Bug

The method torchmetrics.functional.classification.binary_auroc has at least two bugs when run on an MPS device. Bug 1 seems more serious than bug 2, and I suspect the bugs are related.

To demonstrate the two bugs we will use the following code.

import torch
from torchmetrics.functional.classification import binary_auroc

torch.manual_seed(42)

device_cpu = torch.device("cpu")
device_mps = torch.device("mps")

def test_auroc(n, thresholds_cpu, thresholds_mps):
    preds_cpu = torch.rand(n, device=device_cpu)
    target_cpu = torch.rand(n, device=device_cpu).round().int()

    preds_mps = preds_cpu.clone().to(device_mps)
    target_mps = target_cpu.clone().to(device_mps)

    auroc_cpu = binary_auroc(preds_cpu, target_cpu, thresholds=thresholds_cpu)
    auroc_mps = binary_auroc(preds_mps, target_mps, thresholds=thresholds_mps)

    print("CPU AUROC: ", auroc_cpu.item())
    print("MPS AUROC: ", auroc_mps.item())

Bug 1: using no thresholds and enough data always gives 0 AUROC score on MPS device

Calling

test_auroc(2**16, None, None)

prints

CPU AUROC:  0.4983136057853699
MPS AUROC:  0.4983136057853699

which seems reasonable. However, calling

test_auroc(2**17, None, None)

prints

CPU AUROC:  0.49919775128364563
MPS AUROC:  0.0

which seems wrong. I would have expected the MPS AUROC score to be identical to the CPU AUROC score, and with even greater certainty I would have expected both AUROC scores to be much closer to 0.5 than to 0.

Bug 2: using thresholds and sufficiently small data gives inconsistent AUROC score on MPS device

Calling

test_auroc(2**17, 128, 128)

prints

CPU AUROC:  0.4991825520992279
MPS AUROC:  0.4991825520992279

which seems reasonable. However, calling

test_auroc(2**16, 128, 128)

prints

CPU AUROC:  0.4983268976211548
MPS AUROC:  0.4983268678188324

which seems wrong. I would have expected the MPS AUROC score to be identical to the CPU AUROC score, but it deviates slightly. I am not sure whether or not to expect such a deviation, but I'm surprised it deviates for small inputs but not for large. This could be due to the deviations canceling out for sufficiently large input, but since this happens at exactly the same input size as the the one triggering bug 1, I suspect bug 2 is related to bug 1.

Bug 3: wrong type hint in return type in `binary_auroc`

The return type hint for torchmetrics.functional.classification.binary_auroc says it returns a Tuple[Tensor, Tensor, Tensor], but as seen in the above code it really returns just a Tensor. I believe the type hint should be corrected to Union[Tensor, Tuple[Tensor, Tensor, Tensor]]?

Environment

TorchMetrics version (from pip): 0.11.4
Python version: 3.11.2
PyTorch Version: 2.0.0
OS: Mac

Apr 23 '23 13:04 GabrielDamsholt

Hi @GabrielDamsholt, Thanks for reporting the issues. Sadly I do not have access to an MPS accelerated device, so I will have a hard time debugging the issue. Maybe @justusschock you can help out debugging the issue 1 which seem to be most severe.

Here are some input for the rest:

Bug 1

When you report:

test_auroc(2**16, None, None)
CPU AUROC:  0.4983136057853699
test_auroc(2**17, 128, 128)
CPU AUROC:  0.4991825520992279

I can confidently say that this is within the range of values you can expect, and that you should not necessarily expect a value of 0.5 even for very large random input. Just try adding the following code to your script:

from sklearn.metrics import roc_auc_score
print("sklearn AUROC: ", roc_auc_score(target_cpu.cpu(), preds_cpu.cpu()))

and you will see that sklearn returns the exact same values as us. Calculating the area under the roc curve is an approximation and it is therefore never exact.

Bug 2

You report:

CPU AUROC:  0.4983268976211548
MPS AUROC:  0.4983268678188324

which is a difference on the scale of ~1e-8. Again, I think this is within the precision that can be expected, especially between different device architectures. This is also the precision which we run almost all our tests for correctness, so I will not even guarantee a higher precision. I wonder what your application is if you need this kind of precision?

Bug 3

You are completely right here. I send will send a PR soon with these fixes :]

Apr 24 '23 09:04 SkafteNicki

Hello @SkafteNicki, thanks for looking into this. You misunderstand some of my remarks, I will attempt to clarify below.

Regarding your remark to bug 1, the fact that AUROC on CPU returns different results for n = 2**16 and n = 2**17, respectively, irregardless of the thresholds used, and that both results differ from 0.5, is not an issue at all, and it is to be expected, as you write. The sole issue here is that AUROC on MPS returns 0 for n >= 2**17.

Regarding your remark to bug 2, I still think this is a bug, but you are right in that it is of no issue to me. The reason why I think it is a bug is, as explained in my original post, because the AUROC scores differ on CPU and MPS only when n <= 2**16. If they had also differed for n >= 2**17 I would not believe it to be a bug. I have a feeling bug 2 is a direct consequence of bug 1, and it might be of help to track down bug 1.

Apr 24 '23 11:04 GabrielDamsholt

Hello @SkafteNicki, thanks for looking into this. You misunderstand some of my remarks, I will attempt to clarify below.

I am so sorry, never my intention to misunderstand your remarks :]

Regarding your remark to bug 1, the fact that AUROC on CPU returns different results for n = 2**16 and n = 2**17, respectively, irregardless of the thresholds used, and that both results differ from 0.5, is not an issue at all, and it is to be expected, as you write. The sole issue here is that AUROC on MPS returns 0 for n >= 2**17.

Alright, that we can completely agree is a issue and should be looked into. I wonder if it is something we do in torchmetrics or it is some torch operator that has a bug for large inputs (when thresholds=None)

Regarding your remark to bug 2, I still think this is a bug, but you are right in that it is of no issue to me. The reason why I think it is a bug is, as explained in my original post, because the AUROC scores differ on CPU and MPS only when n <= 2**16. If they had also differed for n >= 2**17 I would not believe it to be a bug. I have a feeling bug 2 is a direct consequence of bug 1, and it might be of help to track down bug 1.

Alright, I get your point. Lets try to get bug 1 solved first, and then see if bug 2 is also solved. If it still persist, I am not sure if we will look further into it when the difference is so small, but it need to be explored at least if there is a obvious problem.

Apr 25 '23 07:04 SkafteNicki

Alright, I get your point. Lets try to get bug 1 solved first, and then see if bug 2 is also solved. If it still persist, I am not sure if we will look further into it when the difference is so small, but it need to be explored at least if there is a obvious problem.

@justusschock could you pls help here?

Aug 24 '23 08:08 Borda

@Borda @SkafteNicki is there any plan to fix this?

Mar 14 '24 19:03 mvpatel2000

I could reproduce the error on an Apple M3 Pro chip and torch==2.2.1. The issue boils down to an error in torch at this line.

It turns out that padding for large vector is not well supported on Silicon chips. The following snippet should summarise the issue:

import torch
import torch.nn.functional as F

vec_cpu = F.pad(torch.arange(2**15, device="cpu"), [0, 1], value=2)
print(vec_cpu)
# tensor([    0,     1,     2,  ..., 65534, 65535,     2])

vec_mps = F.pad(torch.arange(2**15, device="mps"), [0, 1], value=2)
print(vec_mps)
# tensor([    0,     1,     2,  ..., 32766, 32767,     2], device='mps:0')

vec_cpu_large = F.pad(torch.arange(2**17, device="cpu"), [0, 1], value=2)
print(vec_cpu_large)
# tensor([    0,     1,     2,  ..., 65534, 65535,     2])

vec_mps_large = F.pad(torch.arange(2**17, device="mps"), [0, 1], value=2)
print(vec_mps_large)
# tensor([2, 0, 0,  ..., 0, 0, 0], device='mps:0')

It will report the issue back to pytorch but in the meantime a simple fix could be to overload pad operations so that it is performed on CPU when the input tensor in on an MPS device ?

Mar 15 '24 08:03 hyenal

torchmetrics torchmetrics copied to clipboard

`binary_auroc` has various bugs on MPS device

🐛 Bug

Bug 1: using no thresholds and enough data always gives 0 AUROC score on MPS device

Bug 2: using thresholds and sufficiently small data gives inconsistent AUROC score on MPS device

Bug 3: wrong type hint in return type in binary_auroc

Environment

Bug 1

Bug 2

Bug 3

torchmetrics
torchmetrics copied to clipboard

Bug 3: wrong type hint in return type in `binary_auroc`