torchmetrics icon indicating copy to clipboard operation
torchmetrics copied to clipboard

Add Normalized Discounted Cumulative Gain (nDCG) Score for Classification

Open JamesLYC88 opened this issue 2 years ago • 4 comments

🚀 Feature

Normalized Discounted Cumulative Gain (nDCG) score for classification.

Motivation

We are using torchmetrics nDCG score in multi-label classification. (Reference)

from torchmetrics import RetrievalNormalizedDCG

At the end of the validation step (finish one data batch, see toy example below), we would call update() of the nDCG metric. (Reference)

"""
Multi-label Example (batch_size = 2, num_classes = 3):
        >>> indexes = tensor([0, 0, 0, 1, 1, 1])
        >>> preds = tensor([0.2, 0.3, 0.5, 0.1, 0.3, 0.5])
        >>> target = tensor([0, 0, 1, 0, 1, 0])
"""
def _shared_eval_step_end(self, batch_parts):
    batch_size, num_classes = batch_parts['target'].shape
    # `indexes` indicates which index a prediction belongs. `RetrievalNormalizedDCG`
    # will compute the mean of nDCG scores over each prediction.
    indexes = torch.arange(
        batch_size*batch_parts['batch_idx'], batch_size*(batch_parts['batch_idx']+1))
    indexes = indexes.unsqueeze(1).repeat(1, num_classes)
    return self.eval_metric.update(
        preds=batch_parts['pred_scores'],
        target=batch_parts['target'],
        indexes=indexes
    )

At the end of the validation epoch (finish all data batches), we would call compute() of the nDCG metric. (Reference)

def _shared_eval_epoch_end(self, step_outputs, split):
    """Get scores such as `Micro-F1`, `Macro-F1`, and monitor metrics defined
    in the configuration file in the end of an epoch.
    Args:
        step_outputs (list): List of the return values from the val or test step end.
        split (str): One of the `val` or `test`.
    Returns:
        metric_dict (dict): Scores for all metrics in the dictionary format.
    """
    metric_dict = self.eval_metric.compute()
    self.log_dict(metric_dict)
    for k, v in metric_dict.items():
        metric_dict[k] = v.item()
    if self.log_path:
        dump_log(metrics=metric_dict, split=split, log_path=self.log_path)
    self.print(tabulate_metrics(metric_dict, split))
    self.eval_metric.reset()
    return metric_dict

The function update() is based on the following implementation in torchmetrics. (Reference)

def update(self, preds: Tensor, target: Tensor, indexes: Tensor) -> None:  # type: ignore
    """Check shape, check and convert dtypes, flatten and add to accumulators."""
    if indexes is None:
        raise ValueError("Argument `indexes` cannot be None")

    indexes, preds, target = _check_retrieval_inputs(
        indexes, preds, target, allow_non_binary_target=self.allow_non_binary_target, ignore_index=self.ignore_index
    )

    self.indexes.append(indexes)
    self.preds.append(preds)
    self.target.append(target)

In our scenario, before calling compute(), we may need (3 * #data * #classes) spaces for 3 variables (indexes, preds, target). Take the popular benchmark AmazonCat-13K with #validation_data = 237K and #classes = 13K for example. The evaluation process may take at least 37GB of memory (3 * 237K * 13K * 4B float). Our evaluation is running on 16GB GPU, so it will lead to CUDA out of memory. For a larger dataset, the problem may become more intractable.

Pitch

We are not sure whether this implementation in update() is for flexibility in the evaluation of the retrieval area. In our case, we do not need to store all of the results (indexes, preds, target) before computing nDCG. We can calculate nDCG batch-wise and average the results during the end of the epoch (by calling compute()).

Alternatives

If the batch-wise implementation cannot be realized in retrieval, is it possible to add the new implementation in classification? Here we also provide an implementation as follows.

from torchmetrics import Metric
from torchmetrics.functional.retrieval.ndcg import retrieval_normalized_dcg

class nDCG(Metric):
    def __init__(
        self,
        top_k
    ):
        super().__init__()
        self.top_k = top_k
        self.add_state("ndcg_sum", default=[], dist_reduce_fx="cat")

    def update(self, preds, target):
        assert preds.shape == target.shape
        self.ndcg_sum += [self._metric(p, t) for p, t in zip(preds, target)]

    def compute(self):
        return torch.stack(self.ndcg_sum).mean()

    def _metric(self, preds, target):
        return retrieval_normalized_dcg(preds, target, k=self.top_k)

Additional context

We compare our above implementation with the current implementation in torchmetrics. The results of nDCG are the same, and our implementation is more time-efficient.

Data / Method nDCG time (s)
A / torchmetrics 80.4379 154.70
A / ours 80.4379 52.96
B / torchmetrics 72.0053 414.94
B / ours 72.0053 227.77

JamesLYC88 avatar Apr 04 '22 17:04 JamesLYC88

Hi! thanks for your contribution!, great first issue!

github-actions[bot] avatar Apr 04 '22 17:04 github-actions[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jun 05 '22 20:06 stale[bot]

Hi, is there any timeline to resolve this problem? The solution provided above works well. This is important for extreme multi-label experiments.

sian-chen avatar Jun 09 '22 08:06 sian-chen

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Aug 13 '22 04:08 stale[bot]