[Bug] Validation fails with OOM on GPU and ArrayMemoryError on CPU fallback (nnU-Net v2, Windows 10, RTX 3060, 16GB RAM)
Description
When running validation (nnUNetv2_train 11 3d_fullres 0 --val --npz) on my dataset, the process fails due to memory issues:
- First, GPU (RTX 3060, 12GB VRAM) runs out of memory.
- Then nnU-Net automatically falls back to CPU, but CPU RAM (16GB) also runs out and crashes with
numpy._core._exceptions._ArrayMemoryError.
This happens during prediction/validation, not during training. Training completes successfully. Prediction on device was unsuccessful, probably due to a lack of memory. Moving results arrays to CPU torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB ... numpy._core._exceptions._ArrayMemoryError: Unable to allocate 738. MiB for an array with shape (3, 123, 512, 512) and data type int64
Environment
- OS: Windows 10
- Python: 3.12
- nnU-Net: v2 (installed via
pip install nnunetv2) - GPU: NVIDIA RTX 3060 (12GB VRAM)
- CPU RAM: 16 GB
- CUDA/cuDNN: CUDA 12.6, cuDNN 9.1.0.2
- PyTorch: 2.8.0+cu126
Dataset Info
- Dataset: Custom liver tumor dataset (LiTS + CRLM + others merged)
- Typical volume size: ~512 × 512 × N voxels
- Plans used:
3d_fullrespatch_size = [64, 192, 192]batch_size = 2
Command
nnUNetv2_train 11 3d_fullres 0 --val --npz
Hi ayusakoc,
This is my first Github post and I am a newbie to medical image segmentation - be nice!
I'm seeing similar issues to you. I am using paid runtime in Google Colab (1x A100 GPU with 40GB GPU RAM and 80GB system CPU RAM).
I've already tried the threadpool_limit fix proposed in Pull Request #2910, made the same changes for threadpool_limits used in batchgenerators_data_loader.py and fixes as per Issues #133 and #2881. My A100 variant has 12 CPU cores and runs OK with the below settings (default values= 1 if not explicitly set by user). OMP_NUM_THREADS=2 MKL_NUM_THREADS=2 OPENBLAS_NUM_THREADS=2 I set nnUNet_n_proc_DA to 12 - this is what the default would be anyway for my runtime - reducing it slows things down but doesn't stop the OOM issues. Prediction args set at -npp=1 and -nps=4 to work for my GPU runtime- if these are too high then you will get other OOM issues
Clearly there's more to this story.
So far, this is what I think is happening:
-
The general problem seems to be something to do with how workers are released for new tasks when they reach the end of prediction, especially when running on CPU. System CPU RAM fills up and crashes because the garbage collection commands focus on clearing GPU RAM only. The background workers are not being reassigned and are sitting there now because changing from threadpool_limits to global_mutex stops the "background workers no longer alive" check from throwing an error, but maybe the core problem of not reassigning workers to downstream functions is still there when running on system RAM??
-
When running prediction, it completes prediction but the CPU RAM resource usage graph shows a big spike and then flatlines at about 25% capacity and sits there, never returning results. I tried leaving this running for more than 1 hour and still nothing. Pursuing the idea that the problem lies with worker reassignment, I tried adding a timeouts at data_iterators.py line 410 and predict_from_raw_data.py line 412, but still didn't get a prediction because that's killing workers too early and returning blank results. The resource graph stops flatlining and prediction DOES complete - that's progress for a newbie like me who actually tried to run validation and inference only on System RAM, but not a fix yet.
- When running validation, there could be 2 separate problems interacting with each other. Prediction doesn't complete on GPU and moves to CPU. The next validation case starts prediction regardless and falls into the same trap, filling up the System RAM until it crashes. If I set device = 'cpu', it will fill up and crash on 3 abdominal scans faster than if device = 'cuda' because that bypasses attempting to predict on GPU first, failing, then moving the arrays to the CPU to continue. So my next thing to look at will be either why 40GB GPU RAM is running out of memory for 1 smallish scan and whether moving the timeout block to downstream of resampling makes a difference in whether any prediction gets saved or not.
If I find any solutions I'll report back.
After more than 1 week of some crazy logging and debugging efforts, my conclusion: 1) Its not just about GPU RAM. CPU RAM is also important for prediction. 2) There's still a separate problem with the training validation loop, even if performing validation separately to training.
As I said, I am a newbie to programming and medical image analysis, so I was guaranteed to make some newbie mistakes. I'm posting about my experiences so at least others can avoid some time wastage. Dataset: PANCREAS-CT Scan size: 512 x 512 pixels x 260-310 slices Voxel size: 0.8mm x 0.8mm x 1.5mm Encoder: nnUNetResEncUNetMPlan Config: 3d_fullres Patch size: 64x 128x 96 - this is smaller than the default patch size calculated by plan and processing - tried it to see if it could reduce RAM spikes before I went to the A100 80GB GPU and it didn't.
Compute credit source: Google Colab Pay-As-You-Go credits and no Pro subscription initially GPU for training: A100 40GB GPU variant, vanilla nnUNet, I only started monkeying with environment variables once I couldn't get validation or prediction to work
GPU for prediction: NVIDIA A100-SXM4-80GB, Cuda compilation tools, release 12.5, V12.5.82 Build cuda_12.5.r12.5/compiler.34385749_0 Environment: vanilla nnUNetv2 git clone from master with no mods, python notebooks Environment variables/ prediction flags- • os.environ["nnUNet_n_proc_DA"] = str(os.cpu_count()) # This defaults to 12 for an A100 runtime • os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True" • NNUNet_def_n_proc=2 • OMP_NUM_THREADS=1 • MKL_NUM_THREADS=1 • OPENBLAS_NUM_THREADS=1 • in manual prediction command - num_processes_preprocessing=2 and num_processes_segmentation_export=1
Prediction prints the "done with prediction filename.nii.gz" message, but the code never stops running not even if you leave it to run for hours If your compute environment doesn't have enough CPU RAM, the OS will kill workers in an attempt to prevent the runtime crashing entirely. However, if the OS-killed worker happens to be in the export_pool defined in predict_from_raw_data.py, it zombifies and never gets past line 412 ( ret = [i.get()[0] for i in r]). It’s a silent hang that I discovered by insane levels of logging. See attached example of one of my 40GB A100 log files. Sadly the log (and the graphs) cut out before the period where case 2 begins preallocating the results arrays simultaneous to case 1 file export and resampling - my code broke when I upgraded it for more automatic logging and I have to fix it before I can post any definitive results. Zombies happen when preprocessing for case 2 and export pools for exporting a prediction for case 1 activate simultaneously - there's no pool-to-pool coordination. I even tried prediction with setting -npp 1 and -nps 1 using the A100 40GB and zombification still happens.
trace_log13_manual_nnUNetlabel_handling4.txt
What you can see from the CPU RAM usage graph is that PID 1710 is the Python notebook kernel and its hanging onto data until it stops running. This isn't helpful if the second case hangs, as BOTH predictions are lost - including the first prediction which is likely to be workable. I tried adding a timeout at line 412 - the longest I tried was about 5mins. Once the process terminates it dumps most of the CPU RAM - including the predictions, leaving me with exactly zero results.
Prediction CPU RAM heaps of activity with GPU RAM nearly idle, despite setting -device 'cuda' That's normal unfortunately. As a newbie I wondered why I was paying for a large GPU and not using GPU RAM. Its offensive to my inner cheapskate. The key point is that many preprocessing and export operations are confined to the CPU RAM because of numpy and simpleITK. I also learned that clear_cache() only works on GPU RAM while I challenged the code to clean up System RAM quicker. Having only used the A100 40GB, it looked like I was just triggering OS-worker-killing at about 80GB CPU RAM spikes. One of the ways I tried to get rid of the CPU RAM spikes was by dividing and adding aggressive gc.collect() calls anywhere the code was creating this type of thing - prediction = self.predict_logits_from_preprocessed_data(torch.from_numpy(data)).cpu().numpy() I did manage to get rid of some CPU RAM spikes from separating the above into separate lines for .cpu() and .numpy(), then deleting the one that wasn't needed. It made a mess of the lovely nnUNet code to do it but it was worth trying based on the info I had at the time. However, the RAM spike that produces a zombie worker during case 2 is probably for write_seg called in export_predictions.py function export_prediction_from_logits().
This was confirmed when I upgraded to Colab Pro and ran prediction on 2 cases in about 40mins with no problems - see resource use trace below. Note that the resource graph tops out at about 140GB during the phase when preprocessing case 2 overlaps saving files for case 1.
I'll post again if I get my memory logger working again and give better traces. Most of my work was done profiling the functions activated by variant 2 here - nnUNet/nnunetv2/inference/examples.py lines 10-26 and 34-43. As part of my efforts to jailbreak results I also tried the run_sequential option by setting -npp = 0 and -nps = 0 - it uses some different functions than variant 2 and I didn't also have time to add garbage collection beyond predict_from_sequential therefore tended to hit RAM spikes sooner.
Hi everyone, after reading the thread from the top again - I tried validation on the A100 80GB GPU once yesterday and can confirm it still crashed my runtime. If you want me to open a separate issue for crashing during prediction let me know.
I have some insanely large 3D images (10-60Gb) in size. I run into the same issue with these on an RTX Pro 6000 with 96GB of VRAM.
I've just assumed it's my insanely large files as these are (~2000, ~2000, ~2000) in shape.
I am certainly interested in this issue.
Edit: Next week, I think I am going to try and copy @o84339 with the same datasets. I have two systems with 500GB and 750GB of system RAM and I run into these exact issues with my current dataset. If I test with the MSD dataset, I am curious if I can provide any other info.