nnUNet
nnUNet copied to clipboard
Prediction on device was unsuccessful, probably due to a lack of memory.
Hi Fabian, hi everybody,
I am experiencing some issues I have never seen before. I am training nnUNet on VerSe2020 right now. Training seems to work perfectly fine, but during the validation at the end of the training I noticed, that I run out of Memory. I have 187 GB of RAM and have never had issues before. I then manually ran a prediction on just one image and get the following output:
nnUNetv2_predict -i /data/test -o /data/test_out -d 556 -c 3d_fullres --verbose
#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################
There are 1 cases in the source folder
I am process 0 out of 1 (max process ID is 0, we start counting with 0!)
There are 1 cases that I would like to predict
old shape: (379, 512, 512), new_shape: [1236 767 767], old_spacing: [3.259999990463257, 1.3671879768371582, 1.3671879768371582], new_spacing: [1.0, 0.912109375, 0.912109375], fn_data: functools.partial(<function resample_data_or_seg_to_shape at 0x7ff2a6562de0>, is_seg=False, order=3, order_z=0, force_separate_z=None)
Predicting MM256_276_0:
perform_everything_on_device: True
Input shape: torch.Size([1, 1236, 767, 767])
step_size: 0.5
mirror_axes: (0, 1, 2)
n_steps 2299, image size is torch.Size([1236, 767, 767]), tile_size [128, 128, 128], tile_step_size 0.5
steps:
[[0, 62, 123, 185, 246, 308, 369, 431, 492, 554, 616, 677, 739, 800, 862, 923, 985, 1046, 1108], [0, 64, 128, 192, 256, 320, 383, 447, 511, 575, 639], [0, 64, 128, 192, 256, 320, 383, 447, 511, 575, 639]]
move image to device cuda
preallocating results arrays on device cuda
Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU
move image to device cpu
preallocating results arrays on device cpu
running prediction
0%| | 0/2299 [00:00<?, ?it/s]
/data/an55321/anaconda3/env/nnunetV2/lib/python3.12/site-packages/torch/nn/modules/conv.py:605: UserWarning: Plan failed with a cudnnException: CUDNN_BACKEND_EXECUTION_PLAN_DESCRIPTOR: cudnnFinalize Descriptor Failed cudnn_status: CUDNN_STATUS_NOT_SUPPORTED (Triggered internally at ../aten/src/ATen/native/cudnn/Conv_v8.cpp:919.)
return F.conv3d(
I ran several trainings on the same machine before and never had issues. I thought it might berelated to a recent Nvidia driver update we did, so I also updated PyTorch and nnUNet:
...
nnunetv2 2.3.1
...
torch 2.3.0
torchaudio 2.3.0
torchvision 0.18.0
Any help would be appreciated.
Thanks, André
Hi André @elpequeno,
did you happen to watch -n 0.1 nvidia-smi
during the inference of this (rather large volume)? It indeed seems to be quite memory-intense, given the large number of steps required with the 128 cubed tiles. This might be a case where it would make sense to reduce the tile_step_size
. Please let me know if the behavior looks different for a smaller overlap!
Hi Gregor,
thank you for your response. Yes, I watched nvidia-smi during the inference. The inference usually takes a long time (several minutes) between
There are 1 cases that I would like to predict
and
old shape: (379, 512, 512), new_shape: [1236 767 767], old_spacing: [3.259999990463257, 1.3671879768371582, 1.3671879768371582], new_spacing: [1.0, 0.912109375, 0.912109375], fn_data: functools.partial(<function resample_data_or_seg_to_shape at 0x7ff2a6562de0>, is_seg=False, order=3, order_z=0, force_separate_z=None)
Only after that do I see any activity on the GPU (I guess that is normal). Then what I see looks similar to the following. Memory usage is usually around 5000 MiB and GPU-Util jumps up and down, but is close to 100% most of the time.
Every 0.1s: nvidia-smi itr-gpu01: Fri May 3 04:33:59 2024
Fri May 3 04:33:59 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 Tesla V100S-PCIE-32GB On | 00000000:3B:00.0 Off | 0 |
| N/A 38C P0 219W / 250W | 5033MiB / 32768MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 Tesla V100S-PCIE-32GB On | 00000000:D8:00.0 Off | 0 |
| N/A 24C P0 24W / 250W | 3MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 3564877 C ...1/anaconda3/env/nnunetV2/bin/python 5022MiB |
+-----------------------------------------------------------------------------------------+
I tried -step_size 0.1
and -step_size 0.3
but did not see much difference. But I saw this line in the output torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 39.28 GiB. GPU
, so I think the behaviour of nnUNet is correct. Just not sure how to deal with this.
Hi @elpequeno,
I think your best bet to reduce memory consumption is to divide this large volume into chunks and then predicting those separately with nnUNet before merging. Luckily @Karol-G has written a nice tool to help in such cases: https://github.com/MIC-DKFZ/patchly You can take a look at the Readme for examples of how to use it!
Hi André @elpequeno,
just checking in. Could the patchly recommendation help in solving the OOM error?
Hi @HussainAlasmawi, please move your case to its own issue in order to keep issues clean and readable. If there's a connection between issues, you can always add a link to a potentially relarted issue.
Due to this issue being stale for a while, I'll close it for now. Feel free to re-open if you still face this issue.