nnUNet icon indicating copy to clipboard operation
nnUNet copied to clipboard

Background workers died and EOFError

Open farrell236 opened this issue 7 months ago • 2 comments

Hi @FabianIsensee!

I have been troubleshooting nnunet this weekend trying to figure out why its blowing up system memory. For context, i am working on our HPC cluster, and the max memory we're allowed to request is 64gb. I am predicting on quite a large volume (735, 512, 512), so I have to make do with the resources i have.

the standard function predict_from_files() does not work, even if nps and npp are set to 1, the system memory blows up, and the background worker gets killed by the cluster for exceeding ram. the error im getting is the same one here: https://github.com/MIC-DKFZ/nnUNet/issues/441

i have modified the entry script for nnUNet_predict in predict_from_raw_data.py to choose either predict_from_files() or predict_from_files_sequential() depending on the input args (see file attached, please let me know if i should PR this).

predict_from_raw_data.py.txt

this works for some, but for really large volumes (like the one with 700 slices or so), it still crashes

Ive nailed the problem to two places...

  1. this line here: https://github.com/MIC-DKFZ/nnUNet/blob/58a3b121a6d1846a978306f6c79a7c005b7d669b/nnunetv2/preprocessing/resampling/default_resampling.py#L144

i checked the input dtype of data is already float, casting it to float again blows up the memory, it still works fine if i comment this line out.

  1. this line here: https://github.com/MIC-DKFZ/nnUNet/blob/58a3b121a6d1846a978306f6c79a7c005b7d669b/nnunetv2/utilities/label_handling/label_handling.py#L178

the memory spikes here, i think it's a python race condition, i need to put a time.sleep(1) and gc.collect() (maybe just garbage collect is needed?). i realised it runs fine in a debugger, but when i start the execution on the command line, the memory spikes here and the cluster kills the process.

Image

the image above, red box = command line, green box = debugger.

please let me know your thoughts on this.

Thanks!

farrell236 avatar May 04 '25 21:05 farrell236

Hi @farrell236, thank you for looking into this. Since the images we are working with are large in spatial resolution the RAM is always an issue so your proposals are welcome here!

I am right now in contact with Fabian regarding this behavior.

sten2lu avatar May 20 '25 14:05 sten2lu

Hey @farrell236 , thanks for your detailed explanation. I am a bit puzzled about your findings:

  1. https://github.com/MIC-DKFZ/nnUNet/blob/58a3b121a6d1846a978306f6c79a7c005b7d669b/nnunetv2/preprocessing/resampling/default_resampling.py#L144 astype with copy=False will not allocate new memory if the dtype is already what it is supposed to be. Therefore you shouldn't see any increase in memory consumption here. What is really consuming the RAM in preprocessing is the resizing due to the skimage resize implementation. There is a way to swap this out in nnU-Net and I recommend you do this. Use nnUNetPlanner_torchres for experiment planning (there are also resenc variants of this)

  2. This is just a plain numpy argmax call which should not use that much RAM. Where do you see the race condition? When testing this line with a similarly shaped array as your image I couldn't find anything suspicious. Again the big memory hog in export is the resizing and I have the suspicion that your debugger is misplacing the line where the rise in memory occurs. Using the _torchres planner suggested above will also greatly reduce the memory footprint of the segmentation export

Best, Fabian

FabianIsensee avatar May 26 '25 07:05 FabianIsensee