torch-uncertainty :books: Add Documentation for OOD Metrics Computation and Interpretation

I'd like to contribute documentation for the OOD metrics used in ClassificationRoutine, starting with a METRICS.md file.

As a first-time user, I found the metrics difficult to interpret without understanding their implementation (#139). I reviewed ClassificationRoutine and its use of torchmetrics binary classification metrics for OOD evaluation. Below is a summary of how these metrics work—I'd appreciate your feedback before drafting the documentation.

Summary

Input to Metrics: ood_scores are passed as preds to binary classification metrics. These scores depend on the ood_criterion (e.g.: "msp": -confs)
Normalization: Binary metrics expect preds in [0, 1]; values outside this range are treated as logits and passed through sigmoid. For "msp" (range ~[-1, -1/C]), this maps to roughly (0.27, 0.5).
Thresholding: Metrics like AUROC/AUPR threshold preds dynamically and label predictions above the threshold as OOD.
Interpretation: These metrics assess how well OOD samples can be separated from ID ones based on model confidence. Implicitly, this assumes a fixed rejection threshold during deployment: predictions below it are treated as OOD.

Let me know if I should proceed with a full draft or make adjustments to this summary first.

Mar 29 '25 17:03 tonyzamyatin

Hello @tonyzamyatin

Thank you for this very relevant comment! I'll have a look later today.

One point is that should gather most of the documentation on the website that we're completely revamping (#142). We could include a specific page for the metrics.

However, the question of having an additional duplicate of the website's documentation in markdown, at least for the most critical parts, is interesting. Do you have an opinion on this @alafage ?

Mar 31 '25 08:03 o-laurent

Agree, it's probably best to include it on the website rather than cluttering the working tree with markdown files.

Also, it's probably only worth including the OOD metrics in the documentation because they use binary classification metrics in a non-standard way. The other classication metrics are used normally and are documented by torchmetrics.

I could get on documenting the OOD metrics because I am writing a summary of all metrics for my thesis anyway.

Mar 31 '25 09:03 tonyzamyatin