nnUNet icon indicating copy to clipboard operation
nnUNet copied to clipboard

`batchgenerators` worker dies silently during `nnUNetv2_train` on Slurm HPC (Persists with Low Workers / High Memory)

Open mark-rustad opened this issue 8 months ago • 4 comments

Environment:

nnUNet version: 2.6.0

batchgenerators version: 0.25.1

PyTorch version: Custom build from source (2.8.0a0+gitf6c1cf0 - built with Conda GCC 13.3, Conda binutils)

Python version: 3.11 (via Conda environment pytorch_build_v2)

CUDA version: 12.6 (via system module)

OS: Linux (Rocky 9.2 based)

torch.compile: Enabled

Context:

I am running nnUNetv2_train on a Slurm-managed HPC cluster within a job submitted with sbatch with the following resources allocated:

#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --partition=gpu
#SBATCH --gres=gpu:A100:1
#SBATCH --array=0-4

1 GPU

16 CPUs

256 GiB RAM (--mem=256G or equivalent)

The number of data augmentation worker processes was explicitly set low for this run:

export nnUNet_n_proc_DA=2

Command Run:

nnUNetv2_train -p nnUNetResEncUNetLPlans --npz -device cuda 511 2d  $SLURM_ARRAY_TASK_ID

(Note: The same error pattern was observed previously 3d_fullres and nnUNet_n_proc_DA=32 and also with nnUNet_n_proc_DA=4)

Observed Behavior:

The training script starts, data loader initializes, torch.compile is noted as active. However, the process fails shortly after starting training (Epoch 0) with a RuntimeError originating from batchgenerators, indicating that background workers have died.

Crucially, no other error messages (OOM, Segfault, CUDA errors, other exceptions) were observed in the complete log file preceding this traceback, despite the error message suggesting to look for them. Some torch._inductor UserWarnings related to online softmax appear after the initial exception is caught in the background thread, but seem unrelated to the worker crash itself.

# --- Log Output & Anonymized Traceback ---
Using device: cuda:0
#######################################################################
# ... (citation message) ...
#######################################################################

2025-04-20 17:27:03.848707: Using torch.compile...
2025-04-20 17:27:06.744956: do_dummy_2d_data_aug: False
2025-04-20 17:27:06.745615: Using splits from existing split file: /path/to/nnUNet_preprocessed/Dataset511_FerretLiverMR/splits_final.json
2025-04-20 17:27:06.746291: The split file contains 5 splits.
2025-04-20 17:27:06.746355: Desired fold for training: 0
2025-04-20 17:27:06.746407: This split has 29 training and 9 validation cases.
using pin_memory on device 0
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
  File "/path/to/conda/envs/pytorch_build_v2/lib/python3.11/threading.py", line 1045, in _bootstrap_inner
    self.run()
  File "/path/to/conda/envs/pytorch_build_v2/lib/python3.11/threading.py", line 982, in run
    self._target(*self._args, **self._kwargs)
  File "/path/to/conda/envs/pytorch_build_v2/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/path/to/conda/envs/pytorch_build_v2/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
/path/to/dmist-nnunet/libs/pytorch/torch/_inductor/lowering.py:7057: UserWarning:
Online softmax is disabled on the fly since Inductor decides to
split the reduction. Cut an issue to PyTorch if this is an
important use case and you want to speed it up with online
softmax.

  warnings.warn(
# ... (Similar inductor warnings repeated) ...
using pin_memory on device 0

# ... (nnU-Net configuration and plan details printed) ...

2025-04-20 17:27:08.974215:
2025-04-20 17:27:08.974327: Epoch 0
2025-04-20 17:27:08.974483: Current learning rate: 0.01
Traceback (most recent call last):
  File "/path/to/conda/envs/pytorch_build_v2/bin/nnUNetv2_train", line 8, in <module>
    sys.exit(run_training_entry())
             ^^^^^^^^^^^^^^^^^^^^
  File "/path/to/dmist-nnunet/external/nnUNet/nnunetv2/run/run_training.py", line 267, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/path/to/dmist-nnunet/external/nnUNet/nnunetv2/run/run_training.py", line 207, in run_training
    nnunet_trainer.run_training()
  File "/path/to/dmist-nnunet/external/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1371, in run_training
    train_outputs.append(self.train_step(next(self.dataloader_train)))
                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/conda/envs/pytorch_build_v2/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__
    item = self.__get_next_item()
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/path/to/conda/envs/pytorch_build_v2/lib/python3.11/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
# --- End Log Output & Anonymized Traceback ---

Expected Behavior:

The training process should continue without the data loader workers crashing.

Troubleshooting Steps Taken:

Verified that no specific error messages from the worker processes appear in the captured log before the RuntimeError.

The error persists even when reducing the number of data augmentation workers significantly (e.g., setting export nnUNet_n_proc_DA=2).

The error persists even when allocating significantly more memory (256 GiB) to the job.

Additional Information:

The silent nature of the worker failure, even with ample memory and few workers, suggests it might be less likely a simple OOM kill and perhaps points towards a low-level segmentation fault (in data loading/augmentation libraries like SimpleITK, numpy, etc.) or a more complex interaction, possibly involving torch.compile or multiprocessing handling within Slurm.

Are there any known issues or further debugging strategies recommended for batchgenerators workers dying silently under these conditions (Slurm, torch.compile active, custom PyTorch build)?

Thanks!

mark-rustad avatar Apr 20 '25 21:04 mark-rustad

Hi, try to run it with deactivated compilation, this can cause these errors. You can deactivate it by setting the environment variable nnUNet_compile to False before starting the training: nnUNet_compile=False nnUNetv2_train

Originally posted by @seziegler in https://github.com/MIC-DKFZ/nnUNet/issues/2712#issuecomment-2678601281

mark-rustad avatar Apr 20 '25 22:04 mark-rustad

run_training.py#L277: os.environ['TORCHINDUCTOR_COMPILE_THREADS'] = 1 type error

if __name__ == '__main__':
    os.environ['OMP_NUM_THREADS'] = '1'
    os.environ['MKL_NUM_THREADS'] = '1'
    os.environ['OPENBLAS_NUM_THREADS'] = '1'
    # reduces the number of threads used for compiling. More threads don't help and can cause problems
    os.environ['TORCHINDUCTOR_COMPILE_THREADS'] = 1
    # multiprocessing.set_start_method("spawn")
    run_training_entry()


Exception has occurred: TypeError
str expected, not int
  File "./nnunetv2/run/run_training.py", line 277, in <module>
    os.environ['TORCHINDUCTOR_COMPILE_THREADS'] = 1
    ~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: str expected, not int

mark-rustad avatar Apr 22 '25 21:04 mark-rustad

I'm also getting the same error with plenty of system memory available

zndr27 avatar Apr 29 '25 21:04 zndr27

If you look through recent issues opened, lots of people are having the same problem.

zndr27 avatar Apr 30 '25 16:04 zndr27

I have the same problem. Have you solved it? Thanks a lot.

JiahaoHuang99 avatar Jul 14 '25 16:07 JiahaoHuang99

Update: Issue resolved after commit 0d04234

I can confirm that after updating my environment to include commit 0d04234, I no longer experience the immediate death of background workers on SLURM.

@JiahaoHuang99 @zndr27 - Yes, this issue appears to be resolved with the latest updates. Try pulling the latest changes and see if that resolves the problem for you as well.

mark-rustad avatar Jul 14 '25 22:07 mark-rustad

Unfortunately this fix did not work for me. I'm running 2.6.2, which includes the commit mentioned. On a cluster node with 128 CPU cores and 512 GB RAM I cannot run more than nnUNet_n_proc_DA=4.

marcoduering avatar Aug 04 '25 15:08 marcoduering

I think I was able to identify the root cause of my problem. It seems related to multiprocessing problems on RHEL. It is already mentioned in #2749! Because when running on Ubuntu, using the identical setup (Apptainer image based on Ubuntu with Pytorch and nnunetv2), there was no issue at all.

I changed inrun_training.py, line 277: # multiprocessing.set_start_method("spawn") to multiprocessing.set_start_method("forkserver", force=True)

Just uncommenting the line as described in #2749 did not have an effect, I had to use forkserver.

This improves the situation slightly, but I still need to reduce the number of workers in most situations (only with high vram, when the GPU is the bottleneck, there is also no issue on RHEL anymore).

Any other ideas? Other than to avoid RHEL as "host" OS?

marcoduering avatar Aug 05 '25 12:08 marcoduering