[Bug]: tilling with padim collapse
Describe the bug
here is code for training
from pathlib import Path
import numpy as np
from lightning.pytorch.callbacks import EarlyStopping, ModelCheckpoint
from PIL import Image
from torchvision import transforms
import torch
from anomalib.data import Folder
from anomalib.engine import Engine
from anomalib.models import Padim
from anomalib import TaskType
from anomalib.callbacks import TilerConfigurationCallback
dataset_root = Path.cwd() / "ad-half-data" / "up"
task = TaskType.SEGMENTATION
datamodule = Folder(
root=dataset_root,
name="phone-half",
normal_dir="good-1024-s",
abnormal_dir="flaw-1024",
mask_dir="mask/flaw-1024",
train_batch_size=1,
eval_batch_size=1,
num_workers=30,
image_size=(1024, 1024),
task=task,
)
model = Padim(backbone="wide_resnet50_2", pre_trained=True, n_features=550)
callbacks = [
ModelCheckpoint(
mode="max",
monitor="pixel_F1Score",
),
EarlyStopping(
monitor="pixel_F1Score",
mode="max",
patience=3,
),
TilerConfigurationCallback(enable=True,
tile_size=256,
stride=256)
]
engine = Engine(
callbacks=callbacks,
pixel_metrics=["F1Score", "AUROC"],
accelerator="auto", # \<"cpu", "gpu", "tpu", "ipu", "hpu", "auto">,
devices=1,
logger=False,
)
engine.train(datamodule=datamodule, model=model)
I got 150 good pictures in good-1024-s for training, and after I run this script the ssh just lost and seems collapse for some reason without tips.
Dataset
Folder
Model
PADiM
Steps to reproduce the behavior
run the code with same 150 1024*1024 imgs
OS information
OS information:
- OS: [e.g. Ubuntu 20.04]
- Python version: [e.g. 3.10.0]
- Anomalib version: [e.g. 0.3.6]
- PyTorch version: [e.g. 2.3.0]
- CUDA/cuDNN version: [e.g. 11.8]
- GPU models and configuration: [ GeForce RTX 4090]
- Any other relevant information: [ I'm using a custom dataset with 1024 size]
Expected behavior
expected to train
Screenshots
No response
Pip/GitHub
GitHub
What version/branch did you use?
No response
Configuration YAML
None
Logs
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Can you try it with a lower resolution or fewer images? It can be an out-of-memory error.
Can you try it with a lower resolution or fewer images? It can be an out-of-memory error.
Yeah, I try to use 30 imgs(1024*1024) to do the training, and the ssh doesn't crash and I got the following logs
Traceback (most recent call last):
File "/home/lzd/patchcore-inspection/anomalib/train-padim.py", line 56, in <module>
engine.train(datamodule=datamodule, model=model)
File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/engine/engine.py", line 863, in train
self.trainer.fit(model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 544, in fit
call._call_and_handle_interrupt(
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 44, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 580, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 987, in _run
results = self._run_stage()
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/trainer.py", line 1033, in _run_stage
self.fit_loop.run()
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 205, in run
self.advance()
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/fit_loop.py", line 363, in advance
self.epoch_loop.run(self._data_fetcher)
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 141, in run
self.on_advance_end(data_fetcher)
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/training_epoch_loop.py", line 295, in on_advance_end
self.val_loop.run()
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/utilities.py", line 182, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 114, in run
self.on_run_start()
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 244, in on_run_start
self._on_evaluation_start()
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/loops/evaluation_loop.py", line 290, in _on_evaluation_start
call._call_lightning_module_hook(trainer, hook_name, *args, **kwargs)
File "/home/lzd/miniconda3/envs/anomalib/lib/python3.10/site-packages/lightning/pytorch/trainer/call.py", line 157, in _call_lightning_module_hook
output = fn(*args, **kwargs)
File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/models/components/base/memory_bank_module.py", line 37, in on_validation_start
self.fit()
File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/models/image/padim/lightning_model.py", line 86, in fit
self.stats = self.model.gaussian.fit(embeddings)
File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/models/components/stats/multi_variate_gaussian.py", line 136, in fit
return self.forward(embedding)
File "/home/lzd/patchcore-inspection/anomalib/anomalib-src/src/anomalib/models/components/stats/multi_variate_gaussian.py", line 117, in forward
covariance = torch.zeros(size=(channel, channel, height * width), device=device)
RuntimeError: [enforce fail at alloc_cpu.cpp:117] err == 0. DefaultCPUAllocator: can't allocate memory: you tried to allocate 79298560000 bytes. Error code 12 (Cannot allocate memory)
I'm using tilling because the model is not good when using high resolution images, but there seems not support well for padim, and I can successfully tilling with patchcore
Does Padim work if you use it without tiling? It could be just different memory requirements for Padim and PatchCore.
I think this indeed is an out of memory issue, but it's rather unusual that PatchCore works and Padim doesn't.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue was closed because it has been stalled for 14 days with no activity.