nnUNet icon indicating copy to clipboard operation
nnUNet copied to clipboard

RuntimeError for predicting validation cases

Open lolawang22 opened this issue 11 months ago • 7 comments

Hi,

I'm running the nnUNet on a cluster using GPU. After finish training 1000 epochs, it always has a runtime error when predicting the 7th validation case. I wonder how can I solve this problem. Here's the full error message, thanks!

$ OMP_NUM_THREADS=1 nnUNetv2_train 220 3d_lowres 0 --val --npz Using device: cuda:0

####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################

2024-03-06 19:35:32.706916: Using splits from existing split file: /nnUNet/nnunet_data/nnUNet_preprocessed/Dataset220_KiTS2023/splits_final.json 2024-03-06 19:35:32.710078: The split file contains 5 splits. 2024-03-06 19:35:32.710967: Desired fold for training: 0 2024-03-06 19:35:32.711860: This split has 391 training and 98 validation cases. 2024-03-06 19:35:32.714308: predicting case_00009 2024-03-06 19:35:32.727309: case_00009, shape torch.Size([1, 98, 225, 225]), rank 0 2024-03-06 19:36:06.597149: predicting case_00010 2024-03-06 19:36:06.648388: case_00010, shape torch.Size([1, 64, 211, 211]), rank 0 2024-03-06 19:36:12.849140: predicting case_00011 2024-03-06 19:36:13.042575: case_00011, shape torch.Size([1, 170, 196, 196]), rank 0 2024-03-06 19:36:28.421149: predicting case_00017 2024-03-06 19:36:28.498979: case_00017, shape torch.Size([1, 206, 185, 185]), rank 0 2024-03-06 19:36:45.562145: predicting case_00019 2024-03-06 19:36:45.847071: case_00019, shape torch.Size([1, 205, 261, 261]), rank 0 2024-03-06 19:37:28.563661: predicting case_00027 2024-03-06 19:37:28.635089: case_00027, shape torch.Size([1, 153, 249, 249]), rank 0 2024-03-06 19:37:58.268433: predicting case_00028 2024-03-06 19:37:58.282625: case_00028, shape torch.Size([1, 208, 234, 234]), rank 0 Traceback (most recent call last): File "/anaconda3/envs/nnunet/bin/nnUNetv2_train", line 8, in sys.exit(run_training_entry()) ^^^^^^^^^^^^^^^^^^^^ File "/nnunet/nnunetv2/run/run_training.py", line 268, in run_training_entry run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights, File "/nnunet/nnunetv2/run/run_training.py", line 208, in run_training nnunet_trainer.perform_actual_validation(export_validation_probabilities) File "/nnunet/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1175, in perform_actual_validation proceed = not check_workers_alive_and_busy(segmentation_export_pool, worker_list, results, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/nnunet/nnunetv2/utilities/file_path_utilities.py", line 103, in check_workers_alive_and_busy raise RuntimeError('Some background workers are no longer alive') RuntimeError: Some background workers are no longer alive

lolawang22 avatar Mar 07 '24 00:03 lolawang22

Hello, I have encountered a similar issue when performing validation during prediction. The problem arises when using the Python multiprocessing module for parallel computation. Based on the provided error messages, there are two main issues: RuntimeError: Some background workers are no longer alive and multiprocessing.managers.RemoteError and KeyError. How can this be resolved?

Saul62 avatar Mar 07 '24 07:03 Saul62

Hi lolawang22 and Saul62, Could you please try this solution?: https://github.com/MIC-DKFZ/nnUNet/issues/1546#issuecomment-1672911183 In critical case: https://github.com/MIC-DKFZ/nnUNet/issues/1546#issuecomment-1731774808

Kobalt93 avatar Mar 08 '24 10:03 Kobalt93

Hi Kobalt93,

Thank you for the suggestion! I noticed that it was using the nnUNetv2_predict, while my error occurs when using nnUNetv2_train for validation. Can nnUNetv2_train also add -npp 2 -nps 2 at the end?

lolawang22 avatar Mar 09 '24 02:03 lolawang22

Hi lolawang22, how much RAM do you have?

---- Replied Message ---- | From | @.> | | Date | 03/09/2024 10:06 | | To | MIC-DKFZ/nnUNet @.> | | Cc | Saul62 @.>, Comment @.> | | Subject | Re: [MIC-DKFZ/nnUNet] RuntimeError for predicting validation cases (Issue #1991) |

Hi Kobalt93,

Thank you for the suggestion! I noticed that it was using the nnUNetv2_predict, while my error occurs when using nnUNetv2_train for validation. Can nnUNetv2_train also add -npp 2 -nps 2 at the end?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

Saul62 avatar Mar 09 '24 02:03 Saul62

Hi Saul62 I think I use 20G with 1 GPU

lolawang22 avatar Mar 09 '24 02:03 lolawang22

Hi Kobalt93, I noticed that nnUNetv2_train does not have -npp and -nps arguments. Instead, I tried -num_gpus 1 but it does not solve my problem. I also tried increasing the RAM to 30G and the validation process did predict 2 more cases but still ended by RunTimeError. I wonder how much RAM is recommended and expected to have for the validation process. Thanks!

lolawang22 avatar Mar 09 '24 20:03 lolawang22

Have you checked whether your RAM was full?

Kobalt93 avatar Mar 22 '24 09:03 Kobalt93