NeMo
NeMo copied to clipboard
Error while restarting training.
Getting the following error : Early stopping conditioned on metric val_loss which is not available. Pass in or modify your EarlyStopping callback to use any of the following: ``
using nemo-toolkit['asr'] == 1.10.0
pytorch-lightning 1.7.3 torch 1.12.1 torchmetrics 0.9.3
looks like this is a Pytorch-lightning help any work around this?
trainer = ptl.Trainer(
val_check_interval= FLAGS.val_check_interval,
devices=-1,
max_epochs=EPOCHS,
accumulate_grad_batches=FLAGS.accumulate_grad_batches,
enable_checkpointing=FLAGS.enable_checkpointing,
logger=FLAGS.logger,
log_every_n_steps=FLAGS.log_every_n_steps, ##value should be high to save memory.
callbacks = [early_stop],
sync_batchnorm = FLAGS.sync_batchnorm,
gradient_clip_val=FLAGS.gradient_clip_val,
num_sanity_val_steps=0,
)
@titu1994
one way I found around this was by removing the callback
We don't track val_loss for RNNT. Switch to metric of val_wer instead
Hi @titu1994 I have successfully resumed training, the only change made is have added a new training dataset.
When resuming from the prev .ckpt checkpoint it says not a end-of-epoch checkpoint and tries to train the prev epoch it was supposed to resume from a the last few steps. then properly moves to next epoch.
Problem is my last recorded wer was 9% and the new current wer is 15%. other params like lr have resumed from were they had left off. Is this much increase a normal behaviour. ??
should I just stop and fine-tune instead of resuming from the last-ckpt checkpoint. and what lr should I use. In fine-tune notebook I have seen lr = 0.025,
This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.
This issue was closed because it has been inactive for 7 days since being marked as stale.