anomalib icon indicating copy to clipboard operation
anomalib copied to clipboard

question about AdaptiveThreshold, possible bug?

Open davidebaltieri31 opened this issue 3 years ago • 5 comments

I'm running anomalib on a custom dataset (around 20k samples) and I've noticed costant crashes after a few epochs/validation steps (DefaultCPUAllocator: can't allocate memory...) almost independently of the modeI I choose.

It seem to be related to AdaptiveThreshold. In it PrecisionRecallCurve is used, and after every validation step new data is added to it, old data is never deleted. Why? Shouldn't data related to the previous validation steps be deleted?

davidebaltieri31 avatar Jul 27 '22 11:07 davidebaltieri31

Since PrecisionRecallCurve works on all data/is computed on the complete data by default in torchmetrics, you will have to hoid all your datapoints in memory. For large datasets, you can look at the binned PR-Curve, implemented for constant-memory requirement by torchmetrics.

ORippler avatar Aug 02 '22 07:08 ORippler

I think my post wasn't clear and I used the term "step" incorrectly. I understand that PrecisionRecallCurve works on all data, but why it keep data from previous epochs? I.e. I run training and validation after each epoch: Epoch 1 -> validation 1 -> epoch 2 -> validation 2 -> etc. which are I think the default settings in the repo config files, my question is why does AdaptiveThreshold uses PrecisionRecallCurve in a way that keep data both from validation 1 and 2? shouldn't it use all data from validation 2 only (which is validation on the latest trained model only)?

davidebaltieri31 avatar Aug 03 '22 11:08 davidebaltieri31

I think my post wasn't clear and I used the term "step" incorrectly. I understand that PrecisionRecallCurve works on all data, but why it keep data from previous epochs? I.e. I run training and validation after each epoch: Epoch 1 -> validation 1 -> epoch 2 -> validation 2 -> etc. which are I think the default settings in the repo config files, my question is why does AdaptiveThreshold uses PrecisionRecallCurve in a way that keep data both from validation 1 and 2? shouldn't it use all data from validation 2 only (which is validation on the latest trained model only)?

Ahh I see. At a quick glance it seems as if the reset method is missing from AdaptiveThreshold.

Can you try to add:

def reset(self):
    self.precision_recall_curve.reset()

to the class and see if that fixes the bug/problem? Class can be found here. overall class should look as follows:

class AdaptiveThreshold(Metric):
    """Optimal F1 Metric.

    Compute the optimal F1 score at the adaptive threshold, based on the F1 metric of the true labels and the
    predicted anomaly scores.
    """

    def __init__(self, default_value: float = 0.5, **kwargs):
        super().__init__(**kwargs)

        self.precision_recall_curve = PrecisionRecallCurve(num_classes=1)
        self.add_state("value", default=torch.tensor(default_value), persistent=True)  # pylint: disable=not-callable
        self.value = torch.tensor(default_value)  # pylint: disable=not-callable

    # pylint: disable=arguments-differ
    def update(self, preds: torch.Tensor, target: torch.Tensor) -> None:  # type: ignore
        """Update the precision-recall curve metric."""
        self.precision_recall_curve.update(preds, target)

    def compute(self) -> torch.Tensor:
        """Compute the threshold that yields the optimal F1 score.

        Compute the F1 scores while varying the threshold. Store the optimal
        threshold as attribute and return the maximum value of the F1 score.

        Returns:
            Value of the F1 score at the optimal threshold.
        """
        precision: torch.Tensor
        recall: torch.Tensor
        thresholds: torch.Tensor

        precision, recall, thresholds = self.precision_recall_curve.compute()
        f1_score = (2 * precision * recall) / (precision + recall + 1e-10)
        if thresholds.dim() == 0:
            # special case where recall is 1.0 even for the highest threshold.
            # In this case 'thresholds' will be scalar.
            self.value = thresholds
        else:
            self.value = thresholds[torch.argmax(f1_score)]
        return self.value
    # ADDED METHOD
    def reset(self):
        self.precision_recall_curve.reset()

Note: Same will be necessary for OptimalF1 also

ORippler avatar Aug 03 '22 11:08 ORippler

Hi, added the reset code to both, but it's not called. I think there are probably other changes that need to be made, probably in anomaly_module or somewhere else in anomalib?

davidebaltieri31 avatar Aug 03 '22 13:08 davidebaltieri31

Hi, added the reset code to both, but it's not called. I think there are probably other changes that need to be made, probably in anomaly_module or somewhere else in anomalib?

Did you install anomalib with the -e flag, i.e. via pip install -e?

If so, please provide a MWE so that the bug can be localized.

ORippler avatar Aug 04 '22 06:08 ORippler

Shouldn't this one be closed?

jpcbertoldo avatar Aug 28 '22 09:08 jpcbertoldo

I believe that the original issue is not yet resolved. While the reset method has been added to the AdaptiveThreshold metric class, the method is never called. This means that for models that need to be trained for multiple epochs, the data from the previous epochs is never cleared, which in turn may lead to out of memory errors. It could also lead to inaccurate computation of the adaptive threshold. We will investigate this a bit further and work on a fix.

djdameln avatar Aug 29 '22 14:08 djdameln