nnUNet icon indicating copy to clipboard operation
nnUNet copied to clipboard

torch.cuda.OutOfMemoryError: CUDA out of memory.

Open revanb88 opened this issue 1 year ago • 2 comments

023-09-23 02:12:42.767428: Epoch 0 2023-09-23 02:12:42.767778: Current learning rate: 0.01 using pin_memory on device 0 Traceback (most recent call last): File "/data/revan/miniconda3/envs/nnUNet/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) File "/data/revan/experiments/CMR_experiments/nnUNetFrame/nnUNet/nnunetv2/run/run_training.py", line 268, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/data/revan/experiments/CMR_experiments/nnUNetFrame/nnUNet/nnunetv2/run/run_training.py", line 204, in run_training nnunet_trainer.run_training() File "/data/revan/experiments/CMR_experiments/nnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1240, in run_training train_outputs.append(self.train_step(next(self.dataloader_train))) File "/data/revan/experiments/CMR_experiments/nnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 881, in train_step output = self.network(data) File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/dynamic_network_architectures/architectures/unet.py", line 60, in forward return self.decoder(skips) File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/dynamic_network_architectures/building_blocks/unet_decoder.py", line 84, in forward x = self.stagess File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/dynamic_network_architectures/building_blocks/simple_conv_blocks.py", line 137, in forward return self.convs(x) File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/dynamic_network_architectures/building_blocks/simple_conv_blocks.py", line 71, in forward return self.all_modules(x) File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forward input = module(input) File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 463, in forward return self._conv_forward(input, self.weight, self.bias) File "/data/revan/miniconda3/envs/nnUNet/lib/python3.9/site-packages/torch/nn/modules/conv.py", line 459, in _conv_forward return F.conv2d(input, weight, bias, self.stride, torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 14.76 GiB total capacity; 1.02 GiB already allocated; 20.75 MiB free; 1.02 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

could anyone help to resolve the above issue ?

revanb88 avatar Sep 23 '23 09:09 revanb88

I also encountered this error during inference. I attempted to use 2 GPUs, but it appears that it is not functioning correctly. How can i do?

Overflowu7 avatar Oct 26 '23 13:10 Overflowu7

Hi revanb88, It is hard to say without nowing more details, but could this issue solve your problem?: https://github.com/MIC-DKFZ/nnUNet/issues/337

Kobalt93 avatar Jan 24 '24 15:01 Kobalt93