nnUNet icon indicating copy to clipboard operation
nnUNet copied to clipboard

The program cannot stop after 1000 epoch

Open Bigsealion opened this issue 2 years ago • 9 comments

Hi, Now, I have finished the training of a 2d nnUNet (all 1000 epoch), but the program does not stop and the GPU memory is still occupied. (In fact, the program is no longer running, and ps -aux show the STAT is S or Sl) Therefore, I can only use the kill command to forcibly stop the process.

My training command is: nohup nnUNet_train 2d nnUNetTrainerV2 Task604_Seg 3 --npz &> ./task604_2d_f3.txt & Did command nohup or & cause this error, or else?

In the default log file (training_log_2021_11_4_15_55_53.txt), the last paragraph is as follows: ` 2021-11-05 13:25:22.656965: epoch: 999 2021-11-05 13:26:33.520940: train loss : -0.9004 2021-11-05 13:26:39.819033: validation loss: -0.8059 2021-11-05 13:26:39.820221: Average global foreground Dice: [0.9105, 0.7286] 2021-11-05 13:26:39.820441: (interpret this as an estimate for the Dice of the different classes. This is not exact.) 2021-11-05 13:26:40.250808: lr: 0.0 2021-11-05 13:26:40.251173: saving scheduled checkpoint file... 2021-11-05 13:26:40.286398: saving checkpoint... 2021-11-05 13:26:40.507148: done, saving took 0.26 seconds 2021-11-05 13:26:40.525447: done 2021-11-05 13:26:40.525670: This epoch took 77.868647 s

2021-11-05 13:26:40.556237: saving checkpoint... 2021-11-05 13:26:40.960068: done, saving took 0.43 seconds 2021-11-05 13:31:52.777114: finished prediction 2021-11-05 13:31:52.777730: evaluation of raw predictions `

In my log file, the last paragraph is as follows: SegData_743 (2, 91, 109, 91) debug: mirroring True mirror_axes (0, 1) 2021-11-05 13:31:52.777114: finished prediction 2021-11-05 13:31:52.777730: evaluation of raw predictions

Thanks!

Bigsealion avatar Nov 05 '21 08:11 Bigsealion