[Bug]: EfficientAd - CUDA out of memory.
Describe the bug
Training an EfficientAd(small) model with all other parameters at default produces CUDA out of memory issue.
Dataset
N/A
Model
N/A
Steps to reproduce the behavior
41 Good Training Images 30 Good Validation Images and 170 Bad Validation Images
EfficientAd(small) model with all other parameters at default
At around epoch 30 of 300 (no callbacks) at 8 minutes or so using an RTX A5000 (24GB) I hit this issue. It feels like the hardware should be sufficient. I also watch the memory usage via nvidia-smi CLI and it swings back and forth but still climbs throughout training.
Adopting the suggested environment variable PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True also did not help
OS information
OS information:
- OS: Ubuntu 24.04.1 LTS
- Python version: 3.10.10
- Anomalib version: 2.0.0b2
- PyTorch version: 2.5.0
- CUDA/cuDNN version: 12.4
- GPU models and configuration: RTX A5000 (24GB)
- Using a custom dataset
Expected behavior
Training to complete
Screenshots
Pip/GitHub
pip
What version/branch did you use?
2.0.0b2
Configuration YAML
Logs
N/A
Code of Conduct
- [x] I agree to follow this project's Code of Conduct
I am running into the exact same issue while using Fastflow.
The jumps in allocated memory happen at the end of every validation epoch. I think I was able to track it down to the BinaryPrecisionRecallCurve(Metric)
Something in the compute is leaking memory. I have been tinkering around a bit but was unable to make it go away until now. Maybe someone else has a good idea?
I think I found it:
Since the preds and targets are always appended in the update call, self.preds and self.targets continuously grow.
Adjusting the compute call like that solved it for me. After the concate the lists can be cleared.
def compute(self) -> tuple[Tensor, Tensor, Tensor]:
"""Compute metric."""
if self.thresholds is None:
if not self.preds or not self.target:
return torch.tensor([]), torch.tensor([]), torch.tensor([])
state = (torch.cat(self.preds), torch.cat(self.target))
self.preds.clear()
self.target.clear()
else:
state = self.confmat
self.confmat.zero_()
precision, recall, thresholds = _binary_precision_recall_curve_compute(state, self.thresholds)
return precision, recall, thresholds if thresholds is not None else torch.tensor([])
I just saw that this is a class from torchmetrics and not anomalib but i'll create an issue there and link it to this one here.
FYI https://github.com/Lightning-AI/torchmetrics/issues/2921
Thanks for sharing @suahelen
There is an error in the implementation of the OneClassPostProcessor. See this answer which also shows the solution.
This results in an accumulation of cached preds and targets in the Metrics used in the PostProcessor. The solution as mentioned in the answer is to call reset() after the compute() of the metrics.
I was also able to verfiy that this works in my case.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue was closed because it has been stalled for 14 days with no activity.