anomalib icon indicating copy to clipboard operation
anomalib copied to clipboard

[Bug]: EfficientAd - CUDA out of memory.

Open leemorton opened this issue 11 months ago • 7 comments

Describe the bug

Training an EfficientAd(small) model with all other parameters at default produces CUDA out of memory issue.

Dataset

N/A

Model

N/A

Steps to reproduce the behavior

41 Good Training Images 30 Good Validation Images and 170 Bad Validation Images

EfficientAd(small) model with all other parameters at default

At around epoch 30 of 300 (no callbacks) at 8 minutes or so using an RTX A5000 (24GB) I hit this issue. It feels like the hardware should be sufficient. I also watch the memory usage via nvidia-smi CLI and it swings back and forth but still climbs throughout training.

Adopting the suggested environment variable PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True also did not help

OS information

OS information:

  • OS: Ubuntu 24.04.1 LTS
  • Python version: 3.10.10
  • Anomalib version: 2.0.0b2
  • PyTorch version: 2.5.0
  • CUDA/cuDNN version: 12.4
  • GPU models and configuration: RTX A5000 (24GB)
  • Using a custom dataset

Expected behavior

Training to complete

Screenshots

Image

Pip/GitHub

pip

What version/branch did you use?

2.0.0b2

Configuration YAML

Image

Logs

N/A

Code of Conduct

  • [x] I agree to follow this project's Code of Conduct

leemorton avatar Jan 21 '25 15:01 leemorton

I am running into the exact same issue while using Fastflow.

The jumps in allocated memory happen at the end of every validation epoch. I think I was able to track it down to the BinaryPrecisionRecallCurve(Metric) Something in the compute is leaking memory. I have been tinkering around a bit but was unable to make it go away until now. Maybe someone else has a good idea?

suahelen avatar Jan 30 '25 00:01 suahelen

I think I found it: Since the preds and targets are always appended in the update call, self.preds and self.targets continuously grow.

Adjusting the compute call like that solved it for me. After the concate the lists can be cleared.

    def compute(self) -> tuple[Tensor, Tensor, Tensor]:
        """Compute metric."""
        if self.thresholds is None:
            if not self.preds or not self.target:
                return torch.tensor([]), torch.tensor([]), torch.tensor([])
            state = (torch.cat(self.preds), torch.cat(self.target))

            self.preds.clear()
            self.target.clear()
        else:
            state = self.confmat
            self.confmat.zero_()

        precision, recall, thresholds = _binary_precision_recall_curve_compute(state, self.thresholds)
        return precision, recall, thresholds if thresholds is not None else torch.tensor([])

suahelen avatar Jan 30 '25 08:01 suahelen

I just saw that this is a class from torchmetrics and not anomalib but i'll create an issue there and link it to this one here.

suahelen avatar Jan 30 '25 08:01 suahelen

FYI https://github.com/Lightning-AI/torchmetrics/issues/2921

suahelen avatar Jan 30 '25 08:01 suahelen

Thanks for sharing @suahelen

samet-akcay avatar Jan 30 '25 14:01 samet-akcay

There is an error in the implementation of the OneClassPostProcessor. See this answer which also shows the solution.

This results in an accumulation of cached preds and targets in the Metrics used in the PostProcessor. The solution as mentioned in the answer is to call reset() after the compute() of the metrics.

I was also able to verfiy that this works in my case.

suahelen avatar Feb 28 '25 07:02 suahelen

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 11 '25 05:06 github-actions[bot]

This issue was closed because it has been stalled for 14 days with no activity.

github-actions[bot] avatar Sep 17 '25 05:09 github-actions[bot]