nnUNet icon indicating copy to clipboard operation
nnUNet copied to clipboard

error in training continuation

Open taborzbislaw opened this issue 3 years ago • 10 comments

Hi,

I am trying to continue training for BrainTumour dataset of Medical Segmentation Decathlon. After training for about 5000 epochs I reach the time limit of a queue system where I am doing computations so I have to restart training. I insert -c option to the command line. The training starts again but I got a strange behaviour as in the enclosed figure - it appears that the training is starting from scratch instead of starting from the latest checkpoint. What can be the cause of the problem?

Best regards, Zbisław progress

taborzbislaw avatar Jan 05 '22 19:01 taborzbislaw

The problem exists even if I run training for smaller number of epochs (e.g. 400) and then start training from a saved checkpoint. From training logs it appears that learning rate is read incorrectly when resuming training: in the enclosed training_log_2022_1_5_21_38_01.txt the last learning rate is 4.6e-05 but after resuming training (training_log_2022_1_6_08_32_46.txt) it is set to 0.007381. So, I expect that the most likely reason for the jump in quality metrics is this unexpected change of learning rate.

training_log_2022_1_5_21_38_01.txt training_log_2022_1_6_08_32_46.txt

progress

taborzbislaw avatar Jan 06 '22 09:01 taborzbislaw

It appears that I have found the source of the problem: one MUST NOT change the maximal number of epochs when resuming training because the learning rate does depend not only on the actual epoch but also on the maximal number of epochs: increasing maximal number of epochs during resuming training increases learning rate compared to the last learning rate from the previous training what causes the jumps in training quality metrics. I am not sure if this is an intended behaviour.

taborzbislaw avatar Jan 06 '22 20:01 taborzbislaw

Hello,

If what you said is correct, then when I finish training 1000 epochs and not achieve the desired effect, I want to increase the training epoch, or I want to increase the training data to continue training, how should I avoid the problem you mentioned?

Best, Crack

KIC-Crack avatar Jan 08 '22 11:01 KIC-Crack

Hi,

changing dataset and continuing training on the modified dataset is a separate problem. If you do that you must rerun nnUNet_plan and proprocess on the new dataset and then manually modify (at least) file "splits_final.pkl" which is created in nnUNet_preprocessed/Task* folder after starting training to include in the final splits your new data. Then you run nnUNet_train with -c option and the model will be trained on the new data. It works, I have tested it some time ago. I only do not remember if "splits_final.pkl" was the only file which I had to modify. It is possible that "plans.pkl" in /nnUNet_trained_models//Task/nnUNetTrainerV2__nnUNetPlansv2.1 also has to be modified.

Concerning learning rate: the simplest solution is to start training from scratch using a maximal number of epochs estimated from the first training. This solution does not requite modyfying nnUNet code (except some trainer class, e.g. https://github.com/MIC-DKFZ/nnUNet/blob/master/nnunet/training/network_training/nnUNetTrainerV2.py, line 48 where maximal number of epochs is hardcoded) but it is clearly a waste of time. In the current implementation of nnUNet, in contrast to what is declared in Nature Methods article (where reduce learning rate on plateau strategy is mentioned), learning rate decays with the epoch, according to the formula (https://github.com/MIC-DKFZ/nnUNet/blob/96d44c2fc1ce5e18da4cd54bf52882047c37982e/nnunet/training/learning_rate/poly_lr.py#L16)

lr = initial_lr * (1 - epoch / max_epochs)**exponent

where initial_lr and max_epochs are set in lines 48 and 49 of https://github.com/MIC-DKFZ/nnUNet/blob/master/nnunet/training/network_training/nnUNetTrainerV2.py and exponent os equal to 0.9 (line 405 of nnUNetTrainerV2.py).

So, for example, you set initial_lr to 0.01 and max_epochs to 1000 - after 1000 epochs your learning rate is 0. Rather bad :( Assume that after 500 epochs you see that your training must last for at least 5000 epochs so you stop the training, change max number of epochs to 5000 and restart training. You have finished previous training with learning rate around 0.0054 but you will start new training with lr around 0.0091 which must lead to effects like in my picture. So, maybe to overcome the problem, the formula for learning rate should be like that:

lr = lr_in_model_used_for_resuming_training * (1 - (epoch - epoch_of_saving_of_model_used_for_resuming_training) / max_epochs)**exponent

Best, Zbisław

taborzbislaw avatar Jan 08 '22 18:01 taborzbislaw

Hello,

Best advice,

I think what you're saying is that before you reach the training max number of epochs (1000) and continue training you can do what you said, right? My training has reached the maximum number of cycles (1000), and I will continue to train by adding more data. Can I use the solution you mentioned? forgive my stupidity😁

Best, Crack

KIC-Crack avatar Jan 09 '22 03:01 KIC-Crack

Hi,

yes, you can, but to continue training with increased number of epochs and changed dataset you must modify nnUNet scripts (at least nnUNetTrainerV2.py) and pickle files splits_final.pkl and (likely) plans.pkl.

Best regards, Zbisław

taborzbislaw avatar Jan 09 '22 15:01 taborzbislaw

Hi there, I am a bit confused about

In the current implementation of nnUNet, in contrast to what is declared in Nature Methods article (where reduce learning rate on plateau strategy is mentioned), learning rate decays with the epoch

The nature methods paper never says this. The reduce on plateau schedule is from the very first version of nnU-Net (2018) and has been outdated for a long time

Also see:

image

as well as Figure 2 in the paper.

You are right in that modifying the number of epochs will cause problems with the LR, but there really is not good way around this with the current learning rate scheduler. This is not necessarily an inherent problem with nnU-Net. This schedule (much liek cosine annealing) is designed to work with a fixed number of epochs. You can of course build workarounds like the one you proposed but my fear is that none of them could be perfect. To me the easiest solution would be the folowing:

  • start some training for 1000 epochs, after XX epochs you notice that you need to train longer
  • create a new nnunet trainer with max_num_epochs equal to the remaining number of epochs you want to train (so if you stop at epoch 500 and want to train for 5000 then 4500 would be what you set)
  • give this trainer a lower initial learning rate. You could set it to the last learning rate from previous training, but the training may already have progresses too much so that the lr is too low. In that case give it something that is lower than nnU-Net's standard initial lr (may need tuning)
  • start a new training with your new trainer class, use pretrained model weights from unfinished training as initialization

Best, Fabian

FabianIsensee avatar Jan 10 '22 13:01 FabianIsensee

Hi,

sorry for my mistake, concerning learning rate scheduler and thank you for the explanations.

Best Zbisław

taborzbislaw avatar Jan 10 '22 15:01 taborzbislaw

Hi, @FabianIsensee Could you please explain what was actually the reason for the change from reduce on platau to poly? Is this leading to faster convergence or was it due to performance issues?

dan-gut avatar Jan 13 '22 21:01 dan-gut

The new schedule gives better segmentation performance. Convergence takes longer though

FabianIsensee avatar Jan 14 '22 06:01 FabianIsensee