PatchCore_anomaly_detection icon indicating copy to clipboard operation
PatchCore_anomaly_detection copied to clipboard

it can train, but it cont test

Open leolv131 opened this issue 3 years ago • 7 comments

leolv131 avatar Aug 23 '21 07:08 leolv131

after train , when test it shows the error, how could i solve the problem(I'm trained to use the default parameters of the code.): RuntimeError: CUDA out of memory. Tried to allocate 19.70 GiB (GPU 0; 8.00 GiB total capacity; 301.95 MiB already allocated; 6.18 GiB free; 326.00 MiB reserved in total by PyTorch)

leolv131 avatar Aug 23 '21 07:08 leolv131

@leolv131 I met same issue. Traceback (most recent call last): File "train.py", line 452, in trainer.test(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in test results = self._run(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run self.dispatch() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 793, in dispatch self.accelerator.start_evaluating(self) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 99, in start_evaluating self.training_type_plugin.start_evaluating(trainer) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 148, in start_evaluating self._results = trainer.run_stage() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 804, in run_stage return self.run_evaluate() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in run_evaluate eval_loop_results = self.run_evaluation() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 962, in run_evaluation output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 170, in evaluation_step output = self.trainer.accelerator.test_step(args) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 245, in test_step return self.training_type_plugin.test_step(*args) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 164, in test_step return self.lightning_module.test_step(*args, **kwargs) File "train.py", line 376, in test_step score_patches = knn(torch.from_numpy(embedding_test).cuda())[0].cpu().detach().numpy() File "train.py", line 51, in call return self.predict(x) File "train.py", line 76, in predict dist = distance_matrix(x, self.train_pts, self.p) ** (1 / self.p) File "train.py", line 35, in distance_matrix dist = torch.pow(x - y, p).sum(2) RuntimeError: CUDA out of memory. Tried to allocate 4.82 GiB (GPU 0; 10.73 GiB total capacity; 5.10 GiB already allocated; 4.47 GiB free; 5.15 GiB reserved in total by PyTorch) It seems like matrix loaded into CUDA resulted the problem, but I don't know how to tackle the problem. @hcw-00 Have you got some advices?

letmejoin avatar Aug 24 '21 08:08 letmejoin

@leolv131 I met same issue. Traceback (most recent call last): File "train.py", line 452, in trainer.test(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 579, in test results = self._run(model) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run self.dispatch() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 793, in dispatch self.accelerator.start_evaluating(self) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 99, in start_evaluating self.training_type_plugin.start_evaluating(trainer) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 148, in start_evaluating self._results = trainer.run_stage() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 804, in run_stage return self.run_evaluate() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1044, in run_evaluate eval_loop_results = self.run_evaluation() File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 962, in run_evaluation output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 170, in evaluation_step output = self.trainer.accelerator.test_step(args) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/accelerators/accelerator.py", line 245, in test_step return self.training_type_plugin.test_step(*args) File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 164, in test_step return self.lightning_module.test_step(*args, **kwargs) File "train.py", line 376, in test_step score_patches = knn(torch.from_numpy(embedding_test).cuda())[0].cpu().detach().numpy() File "train.py", line 51, in call return self.predict(x) File "train.py", line 76, in predict dist = distance_matrix(x, self.train_pts, self.p) ** (1 / self.p) File "train.py", line 35, in distance_matrix dist = torch.pow(x - y, p).sum(2) RuntimeError: CUDA out of memory. Tried to allocate 4.82 GiB (GPU 0; 10.73 GiB total capacity; 5.10 GiB already allocated; 4.47 GiB free; 5.15 GiB reserved in total by PyTorch) It seems like matrix loaded into CUDA resulted the problem, but I don't know how to tackle the problem. @hcw-00 Have you got some advices?

what's your input size? my input size is 224, need 19g gpu

leolv131 avatar Aug 24 '21 08:08 leolv131

@leolv131 I found the solution by set --coreset_sampling_ratio very small, like 0.0001 as the author set. My input is 256X512.

letmejoin avatar Aug 25 '21 06:08 letmejoin

dear all, How to test this train?

NguyenDangBinh avatar Aug 27 '21 16:08 NguyenDangBinh

@leolv131 I found the solution by set --coreset_sampling_ratio very small, like 0.0001 as the author set. My input is 256X512.

Hi, everyone, this method dont solve my problem, I think it is caused by big pickle file as my file is 16M, and the MVtec AD dataset's pickle file is about 1M.

I have tested by set batchsize to 1, but nothing changed. So how to solve this problem?

XiaoPengZong avatar Sep 08 '21 06:09 XiaoPengZong

@leolv131 I found the solution by set --coreset_sampling_ratio very small, like 0.0001 as the author set. My input is 256X512.

Hi, everyone, this method dont solve my problem, I think it is caused by big pickle file as my file is 16M, and the MVtec AD dataset's pickle file is about 1M.

I have tested by set batchsize to 1, but nothing changed. So how to solve this problem?

I modify coreset_sampling_ratio when training,

leolv131 avatar Oct 12 '21 09:10 leolv131