Error while running nnUNet training: RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Hello everyone and thank you for providing the nnUNet code,
I am having problems when running the training. I am using an external server with OS Rocky Linux version 8.9, Python 3.12.3 and CUDA 12.1.0. I installed the environment as described, installing first the compatible PyTorch package and set up the paths in the environment. I ran the preprocessing command without problems in both custom datasets I am using.
I tried using different GPUs such as A100, A40, V100 and T4 to do the training but I always get the same error:
"
/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py:164: FutureWarning: torch.cuda.amp.GradScaler(args...) is deprecated. Please use torch.amp.GradScaler('cuda', args...) instead.
self.grad_scaler = GradScaler() if self.device.type == 'cuda' else None
/path/nnUNetv2/nnunetv2-venv/lib/python3.12/site-packages/torch/optim/lr_scheduler.py:62: UserWarning: The verbose parameter is deprecated. Please use get_last_lr() to access the learning rate.
warnings.warn(
Exception in thread Thread-2 (results_loop):
Traceback (most recent call last):
File "/apps/Arch/software/Python/3.12.3-GCCcore-13.3.0/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
Traceback (most recent call last):
File "/path/nnUNetv2/nnunetv2-venv/nnUNet/nnunetv2/run/run_training.py", line 285, in export nnUNet_n_proc_DA=X, I set export nnUNet_compile=f as recommended in some other issues but still have this problem.
I could run the training for 2d configuration until epoch 914 but it crashed again with the same error using a T4 and 1 worker.
Could you please help me understand where or what is the problem? I would appreciate any recommendation.
Thank you!
+1
+1
+1
Try nnUNet_compile=False nnUNetv2_train --val
I already tried, but still get the error. Thank you!
+1, frankly, repo is hard to reproduce, a lot of errors
I have the same problem. Have you solved it? Thanks a lot
+1