nnUNet RuntimeError: One or more background workers are no longer alive.

Hi! I'm facing RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message. when trying to run training. To train the model on cluster I use the following commands in the Linux's terminal of my cluster:

CUDA_VISIBLE_DEVICES=0 nnUNetv2_train 300 3d_fullres 0 -tr nnUNetTrainer_250epochs

The error that I get when using the command is as below:

############################
INFO: You are using the old nnU-Net default plans. We have updated our recommendations. Please consider using those instead! Read more here: https://github.com/MIC-DKFZ/nnUNet/blob/master/documentation/resenc_presets.md
############################

Using device: cuda:0

#######################################################################
Please cite the following paper when using nnU-Net:
Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211.
#######################################################################

2024-06-17 16:37:10.754485: do_dummy_2d_data_aug: True
2024-06-17 16:37:10.754877: Creating new 5-fold cross-validation split...
2024-06-17 16:37:10.755827: Desired fold for training: 0
2024-06-17 16:37:10.755879: This split has 39 training and 10 validation cases.
using pin_memory on device 0
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/usr/lib64/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/usr/lib64/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/lib64/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/lib64/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Traceback (most recent call last):
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/bin/nnUNetv2_train", line 33, in <module>
    sys.exit(load_entry_point('nnunetv2', 'console_scripts', 'nnUNetv2_train')())
  File "/mnt/nvme0n1p1/scratch/nnUNetFrame/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/mnt/nvme0n1p1/scratch/nnUNetFrame/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training
    nnunet_trainer.run_training()
  File "/mnt/nvme0n1p1/scratchnnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1362, in run_training
    self.on_train_start()
  File "/mnt/nvme0n1p1/scratch/nnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 903, in on_train_start
    self.dataloader_train, self.dataloader_val = self.get_dataloaders()
  File "/mnt/nvme0n1p1/scratch/nnUNetFrame/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 696, in get_dataloaders
    _ = next(mt_gen_train)
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/lib64/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in __next__
    item = self.__get_next_item()
  File "/mnt/nvme0n1p1/scratch/env_nnunetv2/lib64/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

Any assistance to solve this error would be greatly appreciated.

Jun 17 '24 15:06 NastaranVB

Hi @TaWald,

I wanted to follow up on an inquiry I made last week. Since you receive many emails daily, I thought it might have been missed. Could you please revisit my question and assist ? Any help would be appreciated. Warm regards.

Jun 24 '24 12:06 NastaranVB

Same for me. I have A100 cluster, 28 cores, 119 RAM. it works with number of precesses 4, but died if I set more. Usage of RAM for 4 processes is about 17gb-20gb. I also use GeeseFS to mount drive with data.

Jun 24 '24 22:06 fitzjalen

same issue, after change def _do_i_compile(self): return Flase it can start training properly, but keep printout 'num_processes x' from terminal, don't know how to fix that properly.

Jul 04 '24 06:07 MelindaDong

Sorry for the late answer.

Whenever you run into dataloading issues like this you should try to disable your multiprocessing to receive a clearer error message than this obfuscated stack-trace.

If you check the associated code in nnU-Net (see below), It can be seen that you need to set the number of processes to 0 to achieve this. This will provide you with a proper stack trace enabling you to narrow down the issue.

On a more general note this is almost always rooted in a few train cases being corrupted and leading to death of a preprocessing worker. So I would recommend to re-run your preprocessing with the --clean command to remove all previous files.

  if allowed_num_processes == 0:
            mt_gen_train = SingleThreadedAugmenter(dl_tr, None)
            mt_gen_val = SingleThreadedAugmenter(dl_val, None)
        else:
            mt_gen_train = NonDetMultiThreadedAugmenter(data_loader=dl_tr, transform=None,
                                                        num_processes=allowed_num_processes,
                                                        num_cached=max(6, allowed_num_processes // 2), seeds=None,
                                                        pin_memory=self.device.type == 'cuda', wait_time=0.002)
            mt_gen_val = NonDetMultiThreadedAugmenter(data_loader=dl_val,
                                                      transform=None, num_processes=max(1, allowed_num_processes // 2),
                                                      num_cached=max(3, allowed_num_processes // 4), seeds=None,
                                                      pin_memory=self.device.type == 'cuda',
                                                      wait_time=0.002)

Nov 20 '24 08:11 TaWald