torchmetrics
torchmetrics copied to clipboard
Calculations nDCG using GPU are 2x slower than CPU
🐛 Bug
Hi TorchMetrics Team,
In the following example, nDCG calculation using GPU tensors spent 2 times longer the time using CPU tensors and numpy array.
To Reproduce
The codes were tested on both Google Colab and a Slurm cluster.
Code sample
import timeit
import numpy as np
import torch
from sklearn.metrics import ndcg_score
from torchmetrics.functional.retrieval import retrieval_normalized_dcg
# p and t are examples given by both sklearn and torchmetrics
p = [.1, .2, .3, 4, 70] * 100
t = [10, 0, 0, 1, 5] * 100
number = int(1e4)
# 1. BENCHMARK: numpy array
preds = np.asarray([p])
target = np.asarray([t])
def a():
return ndcg_score(target, preds)
print(f'numpy array: {timeit.timeit("a()", setup="from __main__ import a", number=number):.4f}')
# 2. cpu tensor
preds_cpu = torch.tensor(p)
target_cpu = torch.tensor(t)
assert preds_cpu.device == torch.device("cpu")
def b():
retrieval_normalized_dcg(preds_cpu, target_cpu)
print(f'CPU tensor: {timeit.timeit(f"b()", setup="from __main__ import b", number=number):.4f}')
# 3. gpu tensor
preds_gpu = torch.tensor(p, device="cuda")
target_gpu = torch.tensor(t, device="cuda")
assert preds_gpu.device == torch.device("cuda:0")
def c():
retrieval_normalized_dcg(preds_gpu, target_gpu)
print(f'GPU tensor: {timeit.timeit("c()", setup="from __main__ import c", number=number):.4f}')
Results:
# Tesla T4
numpy array: 6.4896
CPU tensor: 5.8501
GPU tensor: 10.4120
I also tested the codes on the Slurm Cluster I'm currently using, the GPU here is an A100.
numpy array: 3.8700
CPU tensor: 2.9305
GPU tensor: 7.7575
Expected behavior
The performance of calculation using GPU tensors, if not superior, should be at least close to CPU tensors.
Environment
- TorchMetrics version (and how you installed TM, e.g.
conda,pip, build from source): 1.2.1 (pip) - Python & PyTorch Version (e.g., 1.0): Python 3.10.12 and 3.10.13, Torch 2.1.0 and 2.1.1
- Any other relevant information such as OS (e.g., Linux): Ubuntu 22.04.3 LTS and Linux 5.4.204-ql-generic-12.0-19 x86_64
Additional context
nDCG calculation using GPU tensors spent 2 times longer the time using CPU tensors and numpy array
Thank you for bringing this up. Have you observed it also with other metric then NDCG?
Thank you for bringing this up. Have you observed it also with other metric then NDCG?
I only tested NDCG at the time submitting the issue. But now I understand the cause of the issue.
The inferior performance of GPU tensor results from the fact that the current implementation of NDCG does not utilize the parallel computation provided by GPU -- TorchMetrics NDCG only accept 1D tensor as inputs.
To prove my observation, I tried another metric, multilabel_precision. The results showed that calculation on GPU is faster than CPU when there are hundreds of instances. However, when there is only one instance, calculation on CPU is faster than on GPU.
Scripts for multilabel_precision performance test
import timeit
import torch
from torchmetrics.functional.classification import multilabel_precision
number = int(1e3)
# change 400 to 1 for comparison experiments
y_true = torch.randint(2, (400, 300))
y_pred = torch.randint(2, (400, 300))
# CPU tensor
target_cpu = y_true.clone().detach()
preds_cpu = y_pred.clone().detach()
assert target_cpu.device == torch.device("cpu")
def cpu():
return multilabel_precision(preds_cpu, target_cpu, num_labels=300)
print(f'CPU tensor: {timeit.timeit("cpu()", setup="from __main__ import cpu", number=number):.4f}')
# GPU tensor
target_gpu = y_true.clone().detach().to(device="cuda")
preds_gpu = y_pred.clone().detach().to(device="cuda")
assert target_gpu.device == torch.device("cuda:0")
def gpu():
return multilabel_precision(preds_gpu, target_gpu, num_labels=300)
print(f'GPU tensor: {timeit.timeit("gpu()", setup="from __main__ import gpu", number=number):.4f}')
# 400 instances Results:
CPU tensor: 3.6518
GPU tensor: 0.8089
# 1 instance Results:
CPU tensor: 0.1848
GPU tensor: 0.6217
Is there any special concern that torchmetrics.NDCG only accepts a single instance instead of a batch? If not, I suggest NDCG should accept batch inputs.
@donglihe-hub thanks for reporting this issue. Sorry for the long reply time from my side.
I been looking at the implementation of our metric for a bit of time now and it is not correct that the implementation is not using parallel computations on GPU. Just because the input is 1D does not mean that the computations cannot be parallelized.
For example doing a simple sum
is equally fast, regardless of input is a 1d or 2d tensor.
Looking at the code, it seems that the operation that takes up most computational time is torch.unique used here. From small experiments, it seems that this operation alone is a bottleneck:
the torch gpu implementation is ~15 times slower for large arrays.
I am not sure if we can actually optimize the code or the operations used in the ndcg metric does simply not parallelize that well on GPU. I try to investigate further.
Hi!
I'm running into the same issue where the ndcg metric calculation is taking too long and becomes impractical to use while training. Calculating ndcg metric for every step with tensor size around (8000, 40) [batch_size, list_size] takes 2s to complete, and is way higher than the model forward pass.
After looking into the metric class implementation, I believe it is not because of the torch.unique function but the fundamental design flaw of the RetrievalMetric. The RetrievalMetric class splits the input tensor with the indexes into a list of tensors and iterates sequentially over that list, which is very slow when the number of query groups is high.
The tensorflow ranking implementation of the nDCG metric with the same inputs only takes about 50ms to complete.