NeMo icon indicating copy to clipboard operation
NeMo copied to clipboard

Error while restarting training.

Open evilc3 opened this issue 3 years ago • 3 comments

Getting the following error : Early stopping conditioned on metric val_loss which is not available. Pass in or modify your EarlyStopping callback to use any of the following: ``

using nemo-toolkit['asr'] == 1.10.0

pytorch-lightning 1.7.3 torch 1.12.1 torchmetrics 0.9.3

looks like this is a Pytorch-lightning help any work around this?

trainer = ptl.Trainer(
                      val_check_interval= FLAGS.val_check_interval,
                      devices=-1, 
                      max_epochs=EPOCHS, 
                      accumulate_grad_batches=FLAGS.accumulate_grad_batches,
                      enable_checkpointing=FLAGS.enable_checkpointing,
                      logger=FLAGS.logger,
                      log_every_n_steps=FLAGS.log_every_n_steps, ##value should be high to save memory.
                     callbacks = [early_stop],
                      sync_batchnorm = FLAGS.sync_batchnorm,
                      gradient_clip_val=FLAGS.gradient_clip_val,
                      num_sanity_val_steps=0,
                      )

@titu1994

evilc3 avatar Sep 09 '22 13:09 evilc3

one way I found around this was by removing the callback

evilc3 avatar Sep 09 '22 13:09 evilc3

We don't track val_loss for RNNT. Switch to metric of val_wer instead

titu1994 avatar Sep 09 '22 13:09 titu1994

Hi @titu1994 I have successfully resumed training, the only change made is have added a new training dataset.

When resuming from the prev .ckpt checkpoint it says not a end-of-epoch checkpoint and tries to train the prev epoch it was supposed to resume from a the last few steps. then properly moves to next epoch.

Problem is my last recorded wer was 9% and the new current wer is 15%. other params like lr have resumed from were they had left off. Is this much increase a normal behaviour. ??

should I just stop and fine-tune instead of resuming from the last-ckpt checkpoint. and what lr should I use. In fine-tune notebook I have seen lr = 0.025,

evilc3 avatar Sep 14 '22 08:09 evilc3

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] avatar Oct 15 '22 02:10 github-actions[bot]

This issue was closed because it has been inactive for 7 days since being marked as stale.

github-actions[bot] avatar Oct 23 '22 02:10 github-actions[bot]