nnUNet
nnUNet copied to clipboard
RuntimeError for predicting validation cases
Hi,
I'm running the nnUNet on a cluster using GPU. After finish training 1000 epochs, it always has a runtime error when predicting the 7th validation case. I wonder how can I solve this problem. Here's the full error message, thanks!
$ OMP_NUM_THREADS=1 nnUNetv2_train 220 3d_lowres 0 --val --npz Using device: cuda:0
####################################################################### Please cite the following paper when using nnU-Net: Isensee, F., Jaeger, P. F., Kohl, S. A., Petersen, J., & Maier-Hein, K. H. (2021). nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation. Nature methods, 18(2), 203-211. #######################################################################
2024-03-06 19:35:32.706916: Using splits from existing split file: /nnUNet/nnunet_data/nnUNet_preprocessed/Dataset220_KiTS2023/splits_final.json
2024-03-06 19:35:32.710078: The split file contains 5 splits.
2024-03-06 19:35:32.710967: Desired fold for training: 0
2024-03-06 19:35:32.711860: This split has 391 training and 98 validation cases.
2024-03-06 19:35:32.714308: predicting case_00009
2024-03-06 19:35:32.727309: case_00009, shape torch.Size([1, 98, 225, 225]), rank 0
2024-03-06 19:36:06.597149: predicting case_00010
2024-03-06 19:36:06.648388: case_00010, shape torch.Size([1, 64, 211, 211]), rank 0
2024-03-06 19:36:12.849140: predicting case_00011
2024-03-06 19:36:13.042575: case_00011, shape torch.Size([1, 170, 196, 196]), rank 0
2024-03-06 19:36:28.421149: predicting case_00017
2024-03-06 19:36:28.498979: case_00017, shape torch.Size([1, 206, 185, 185]), rank 0
2024-03-06 19:36:45.562145: predicting case_00019
2024-03-06 19:36:45.847071: case_00019, shape torch.Size([1, 205, 261, 261]), rank 0
2024-03-06 19:37:28.563661: predicting case_00027
2024-03-06 19:37:28.635089: case_00027, shape torch.Size([1, 153, 249, 249]), rank 0
2024-03-06 19:37:58.268433: predicting case_00028
2024-03-06 19:37:58.282625: case_00028, shape torch.Size([1, 208, 234, 234]), rank 0
Traceback (most recent call last):
File "/anaconda3/envs/nnunet/bin/nnUNetv2_train", line 8, in
Hello, I have encountered a similar issue when performing validation during prediction. The problem arises when using the Python multiprocessing module for parallel computation. Based on the provided error messages, there are two main issues: RuntimeError: Some background workers are no longer alive and multiprocessing.managers.RemoteError and KeyError. How can this be resolved?
Hi lolawang22 and Saul62, Could you please try this solution?: https://github.com/MIC-DKFZ/nnUNet/issues/1546#issuecomment-1672911183 In critical case: https://github.com/MIC-DKFZ/nnUNet/issues/1546#issuecomment-1731774808
Hi Kobalt93,
Thank you for the suggestion! I noticed that it was using the nnUNetv2_predict, while my error occurs when using nnUNetv2_train for validation. Can nnUNetv2_train also add -npp 2 -nps 2 at the end?
Hi lolawang22, how much RAM do you have?
---- Replied Message ---- | From | @.> | | Date | 03/09/2024 10:06 | | To | MIC-DKFZ/nnUNet @.> | | Cc | Saul62 @.>, Comment @.> | | Subject | Re: [MIC-DKFZ/nnUNet] RuntimeError for predicting validation cases (Issue #1991) |
Hi Kobalt93,
Thank you for the suggestion! I noticed that it was using the nnUNetv2_predict, while my error occurs when using nnUNetv2_train for validation. Can nnUNetv2_train also add -npp 2 -nps 2 at the end?
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
Hi Saul62 I think I use 20G with 1 GPU
Hi Kobalt93, I noticed that nnUNetv2_train does not have -npp and -nps arguments. Instead, I tried -num_gpus 1 but it does not solve my problem. I also tried increasing the RAM to 30G and the validation process did predict 2 more cases but still ended by RunTimeError. I wonder how much RAM is recommended and expected to have for the validation process. Thanks!
Have you checked whether your RAM was full?