nnUNet Parallel training processes randomly crash

Hello,

I have been trying to run multiple folds in parallel but the processes sometimes randomly fail after x number of epochs (can be 90 epochs but also 460 epochs, etc...).

Notes about the server

3 x H100 (each 80GB memory)
Processor with 256 CPUs
1TB RAM

Training facts

NGC docker container used: nvcr.io/nvidia/pytorch:24.01-py3
- Driver Version: 545.23.08 CUDA Version: 12.3
Each training process is assigned one of three GPUs. (NO distributed training)
10 parallel training processes (fold 0-4 lowres, fold 0-4 fullres)
GPU memory is never exhausted
CPU doesn’t oversubscribe
No load alerts. Server runs at ~70% user, 5% system, rest is idle, during training.

Training very consistent until some of the processes die then the utilization drops but everything else stays the same.

When running with CUDA_LAUNCH_BLOCKING=1, everything completes properly (of course way slower).

I am hoping you might have some insights about what might be going on.

Traceback (most recent call last):
  File "/usr/local/bin/nnUNetv2_train", line 8, in <module>
    sys.exit(run_training_entry())
  File "/usr/local/lib/python3.10/dist-packages/nnunetv2/run/run_training.py", line 268, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/usr/local/lib/python3.10/dist-packages/nnunetv2/run/run_training.py", line 204, in run_training
    nnunet_trainer.run_training()
  File "/usr/local/lib/python3.10/dist-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1242, in run_training
    train_outputs.append(self.train_step(next(self.dataloader_train)))
  File "/usr/local/lib/python3.10/dist-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 890, in train_step
    torch.nn.utils.clip_grad_norm_(self.network.parameters(), 12)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/utils/clip_grad.py", line 76, in clip_grad_norm_
    torch._foreach_mul_(grads, clip_coef_clamped.to(device))  # type: ignore[call-overload]
RuntimeError: CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception in thread Thread-4 (results_loop):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/usr/local/lib/python3.10/dist-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Exception in thread Thread-5 (results_loop):
Traceback (most recent call last):
  File "/usr/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.10/threading.py", line 953, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/usr/local/lib/python3.10/dist-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
2024-02-08 05:00:22.263358: train_loss -0.8898
2024-02-08 05:00:22.263632: val_loss -0.6569
2024-02-08 05:00:22.263725: Pseudo dice [0.7771, 0.6896, 0.7566]
2024-02-08 05:00:22.263822: Epoch time: 139.85 s
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: unspecified launch failure
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /opt/pytorch/pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xae (0x7f576e1b12ce in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f576e16798b in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3f2 (0x7f576ef93e72 in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10_cuda.so)
frame #3: <unknown function> + 0xd5688c (0x7f56e04b388c in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd58c00 (0x7f56e04b5c00 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x48333a (0x7f57265a833a in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0xd (0x7f576e18cedd in /usr/local/lib/python3.10/dist-packages/torch/lib/libc10.so)
frame #7: <unknown function> + 0x742d68 (0x7f5726867d68 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #8: THPVariable_subclass_dealloc(_object*) + 0x2e6 (0x7f57268680c6 in /usr/local/lib/python3.10/dist-packages/torch/lib/libtorch_python.so)
frame #9: <unknown function> + 0x13bf21 (0x55938507ef21 in /usr/bin/python)
frame #10: <unknown function> + 0x13bd1c (0x55938507ed1c in /usr/bin/python)
frame #11: <unknown function> + 0x1561d0 (0x5593850991d0 in /usr/bin/python)
frame #12: <unknown function> + 0x16a1c8 (0x5593850ad1c8 in /usr/bin/python)
frame #13: <unknown function> + 0x16a1f5 (0x5593850ad1f5 in /usr/bin/python)
frame #14: <unknown function> + 0x16a1f5 (0x5593850ad1f5 in /usr/bin/python)
frame #15: <unknown function> + 0x16a1f5 (0x5593850ad1f5 in /usr/bin/python)
frame #16: <unknown function> + 0x16a1f5 (0x5593850ad1f5 in /usr/bin/python)
frame #17: <unknown function> + 0x12d8ef (0x5593850708ef in /usr/bin/python)
frame #18: PyDict_SetItemString + 0xa3 (0x5593850749b3 in /usr/bin/python)
frame #19: <unknown function> + 0x267dd7 (0x5593851aadd7 in /usr/bin/python)
frame #20: Py_FinalizeEx + 0x148 (0x5593851a7b98 in /usr/bin/python)
frame #21: Py_RunMain + 0x173 (0x5593851992d3 in /usr/bin/python)
frame #22: Py_BytesMain + 0x2d (0x55938516fcad in /usr/bin/python)
frame #23: <unknown function> + 0x29d90 (0x7f576f8e4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #24: __libc_start_main + 0x80 (0x7f576f8e4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #25: _start + 0x25 (0x55938516fba5 in /usr/bin/python)

2024-02-08 05:00:23.462454:

Feb 19 '24 16:02 che85

Hey @che85 , how is your RAM during training? Best, Sebastian

Feb 19 '24 16:02 seziegler

Hi @seziegler,

RAM usage is about 160GB during training. 1TB RAM installed on the machine.

Feb 19 '24 16:02 che85

Unfortunately we don't have experience with training multiple runs on one GPU simultaneously. However, when researching your error terminate called after throwing an instance of 'c10::Error' it seems that the reason for that is mostly related to some CUDA version mismatches. Maybe you can try different CUDA versions and see if the error persists.

Feb 20 '24 09:02 seziegler