torchmetrics
torchmetrics copied to clipboard
Add Normalized Discounted Cumulative Gain (nDCG) Score for Classification
🚀 Feature
Normalized Discounted Cumulative Gain (nDCG) score for classification.
Motivation
We are using torchmetrics nDCG score in multi-label classification. (Reference)
from torchmetrics import RetrievalNormalizedDCG
At the end of the validation step (finish one data batch, see toy example below), we would call update() of the nDCG metric. (Reference)
"""
Multi-label Example (batch_size = 2, num_classes = 3):
>>> indexes = tensor([0, 0, 0, 1, 1, 1])
>>> preds = tensor([0.2, 0.3, 0.5, 0.1, 0.3, 0.5])
>>> target = tensor([0, 0, 1, 0, 1, 0])
"""
def _shared_eval_step_end(self, batch_parts):
batch_size, num_classes = batch_parts['target'].shape
# `indexes` indicates which index a prediction belongs. `RetrievalNormalizedDCG`
# will compute the mean of nDCG scores over each prediction.
indexes = torch.arange(
batch_size*batch_parts['batch_idx'], batch_size*(batch_parts['batch_idx']+1))
indexes = indexes.unsqueeze(1).repeat(1, num_classes)
return self.eval_metric.update(
preds=batch_parts['pred_scores'],
target=batch_parts['target'],
indexes=indexes
)
At the end of the validation epoch (finish all data batches), we would call compute() of the nDCG metric. (Reference)
def _shared_eval_epoch_end(self, step_outputs, split):
"""Get scores such as `Micro-F1`, `Macro-F1`, and monitor metrics defined
in the configuration file in the end of an epoch.
Args:
step_outputs (list): List of the return values from the val or test step end.
split (str): One of the `val` or `test`.
Returns:
metric_dict (dict): Scores for all metrics in the dictionary format.
"""
metric_dict = self.eval_metric.compute()
self.log_dict(metric_dict)
for k, v in metric_dict.items():
metric_dict[k] = v.item()
if self.log_path:
dump_log(metrics=metric_dict, split=split, log_path=self.log_path)
self.print(tabulate_metrics(metric_dict, split))
self.eval_metric.reset()
return metric_dict
The function update() is based on the following implementation in torchmetrics. (Reference)
def update(self, preds: Tensor, target: Tensor, indexes: Tensor) -> None: # type: ignore
"""Check shape, check and convert dtypes, flatten and add to accumulators."""
if indexes is None:
raise ValueError("Argument `indexes` cannot be None")
indexes, preds, target = _check_retrieval_inputs(
indexes, preds, target, allow_non_binary_target=self.allow_non_binary_target, ignore_index=self.ignore_index
)
self.indexes.append(indexes)
self.preds.append(preds)
self.target.append(target)
In our scenario, before calling compute(), we may need (3 * #data * #classes) spaces for 3 variables (indexes, preds, target). Take the popular benchmark AmazonCat-13K with #validation_data = 237K and #classes = 13K for example. The evaluation process may take at least 37GB of memory (3 * 237K * 13K * 4B float). Our evaluation is running on 16GB GPU, so it will lead to CUDA out of memory. For a larger dataset, the problem may become more intractable.
Pitch
We are not sure whether this implementation in update() is for flexibility in the evaluation of the retrieval area. In our case, we do not need to store all of the results (indexes, preds, target) before computing nDCG. We can calculate nDCG batch-wise and average the results during the end of the epoch (by calling compute()).
Alternatives
If the batch-wise implementation cannot be realized in retrieval, is it possible to add the new implementation in classification? Here we also provide an implementation as follows.
from torchmetrics import Metric
from torchmetrics.functional.retrieval.ndcg import retrieval_normalized_dcg
class nDCG(Metric):
def __init__(
self,
top_k
):
super().__init__()
self.top_k = top_k
self.add_state("ndcg_sum", default=[], dist_reduce_fx="cat")
def update(self, preds, target):
assert preds.shape == target.shape
self.ndcg_sum += [self._metric(p, t) for p, t in zip(preds, target)]
def compute(self):
return torch.stack(self.ndcg_sum).mean()
def _metric(self, preds, target):
return retrieval_normalized_dcg(preds, target, k=self.top_k)
Additional context
We compare our above implementation with the current implementation in torchmetrics. The results of nDCG are the same, and our implementation is more time-efficient.
Data / Method | nDCG | time (s) |
---|---|---|
A / torchmetrics | 80.4379 | 154.70 |
A / ours | 80.4379 | 52.96 |
B / torchmetrics | 72.0053 | 414.94 |
B / ours | 72.0053 | 227.77 |
Hi! thanks for your contribution!, great first issue!
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi, is there any timeline to resolve this problem? The solution provided above works well. This is important for extreme multi-label experiments.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.