nnUNet icon indicating copy to clipboard operation
nnUNet copied to clipboard

Error while running nnUNet training: RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message

Open elena-mulero opened this issue 1 year ago • 8 comments

Hello everyone and thank you for providing the nnUNet code,

I am having problems when running the training. I am using an external server with OS Rocky Linux version 8.9, Python 3.12.3 and CUDA 12.1.0. I installed the environment as described, installing first the compatible PyTorch package and set up the paths in the environment. I ran the preprocessing command without problems in both custom datasets I am using. I tried using different GPUs such as A100, A40, V100 and T4 to do the training but I always get the same error: " /path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:164: FutureWarning: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead. self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None /path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate. warnings.warn( Exception in thread Thread-2 (results_loop): Traceback (most recent call last): File "/apps/Arch/software/Python/3.12.3-GCCcore-13.3.0/lib/python3.12/threading.py", line 1073, in _bootstrap_inner Traceback (most recent call last): File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 285, in self.run() File "/apps/Arch/software/Python/3.12.3-GCCcore-13.3.0/lib/python3.12/threading.py", line 1010, in run self._target(*self._args, **self._kwargs) File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop run_training_entry() File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 275, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 211, in run_training nnunet_trainer.run_training() File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1377, in run_training val_outputs.append(self.validation_step(next(self.dataloader_val))) ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 196, in next item = self.__get_next_item() ^^^^^^^^^^^^^^^^^^^^^^ File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 181, in __get_next_item raise e File "/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the " RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message " Sometimes I get the error immediately but at times I can reach some epochs... I tried with all configurations and different folds, but in all cases I got the error. I tried to reduce the number of workers in all cases with export nnUNet_n_proc_DA=X, I set export nnUNet_compile=f as recommended in some other issues but still have this problem. I could run the training for 2d configuration until epoch 914 but it crashed again with the same error using a T4 and 1 worker.

Could you please help me understand where or what is the problem? I would appreciate any recommendation.

Thank you!

elena-mulero avatar Dec 10 '24 17:12 elena-mulero

+1

sunyan1024 avatar Dec 19 '24 08:12 sunyan1024

+1

tjhendrickson avatar Dec 19 '24 15:12 tjhendrickson

+1

MauroLeidi avatar Jan 22 '25 14:01 MauroLeidi

Try nnUNet_compile=False nnUNetv2_train --val

lukalafaye avatar Feb 12 '25 15:02 lukalafaye

I already tried, but still get the error. Thank you!

elena-mulero avatar Feb 14 '25 10:02 elena-mulero

+1, frankly, repo is hard to reproduce, a lot of errors

hizuka590 avatar Mar 30 '25 01:03 hizuka590

I have the same problem. Have you solved it? Thanks a lot

JiahaoHuang99 avatar Jul 14 '25 16:07 JiahaoHuang99

+1

shahzaib3120 avatar Sep 01 '25 00:09 shahzaib3120