mmengine icon indicating copy to clipboard operation
mmengine copied to clipboard

[Bug] Crash after val epoch with ReduceLROnPlateau

Open MiXaiLL76 opened this issue 1 year ago • 3 comments

Prerequisite

  • [X] I have searched Issues and Discussions but cannot get the expected help.
  • [X] The bug has not been fixed in the latest version(https://github.com/open-mmlab/mmengine).

Environment

I'm training mmocr which has no loss calculation in the validation epoch. Thus, I end up with an error:

https://github.com/open-mmlab/mmengine/blob/main/mmengine/optim/scheduler/param_scheduler.py#L1488

File "/home/rdl/.local/lib/python3.8/site-packages/mmengine/hooks/param_scheduler_hook.py", line 120, in after_val_epoch
    step(runner.param_schedulers)
  File "/home/rdl/.local/lib/python3.8/site-packages/mmengine/hooks/param_scheduler_hook.py", line 117, in step
    scheduler.step(metrics)
  File "/home/rdl/.local/lib/python3.8/site-packages/mmengine/optim/scheduler/param_scheduler.py", line 1488, in step
    raise KeyError(f'Excepted key in {list(metrics.keys())},'

I think it's worth revisiting this error and adding a new variable that will be responsible for the need for this error.

In my understanding - there were no values - there is no calculation.

The same thing, if you set to follow the validation (and not for loss), errors fall out.

Reproduces the problem - code sample

dict(
    type='ReduceOnPlateauLR',
    monitor='loss',
    patience=5,
    factor=0.5,
    begin=int(EPOCH_COUNT * 0.1),
),

Reproduces the problem - command or script

python3 train.py any_config_with_sheduler

Reproduces the problem - error message

Traceback (most recent call last):
  File "./train.py", line 123, in <module>
    main()
  File "./train.py", line 119, in main
    runner.train()
  File "/home/rdl/.local/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1706, in train
    model = self.train_loop.run()  # type: ignore
  File "/home/rdl/.local/lib/python3.8/site-packages/mmengine/runner/loops.py", line 102, in run
    self.runner.val_loop.run()
  File "/home/rdl/.local/lib/python3.8/site-packages/mmengine/runner/loops.py", line 367, in run
    self.runner.call_hook('after_val_epoch', metrics=metrics)
  File "/home/rdl/.local/lib/python3.8/site-packages/mmengine/runner/runner.py", line 1768, in call_hook
    getattr(hook, fn_name)(self, **kwargs)
  File "/home/rdl/.local/lib/python3.8/site-packages/mmengine/hooks/param_scheduler_hook.py", line 120, in after_val_epoch
    step(runner.param_schedulers)
  File "/home/rdl/.local/lib/python3.8/site-packages/mmengine/hooks/param_scheduler_hook.py", line 117, in step
    scheduler.step(metrics)
  File "/home/rdl/.local/lib/python3.8/site-packages/mmengine/optim/scheduler/param_scheduler.py", line 1488, in step
    raise KeyError(f'Excepted key in {list(metrics.keys())},'
KeyError: "Excepted key in ['CocoOCRDataset/recog/word_acc', 'CocoOCRDataset/recog/word_acc_ignore_case', 'CocoOCRDataset/recog/word_acc_ignore_case_symbol', 'CocoOCRDataset/recog/char_recall', 'CocoOCRDataset/recog/char_precision', 'CocoOCRDataset/recog/1-N.E.D'], but got key loss is not in dict"

Additional information

No response

MiXaiLL76 avatar May 03 '23 15:05 MiXaiLL76

Hi @MiXaiLL76 , thanks for your feedback. Could you help us refine the error message?

zhouzaida avatar May 04 '23 03:05 zhouzaida

same question, key loss not in dict

caj-github avatar Aug 25 '23 01:08 caj-github

Hi @MiXaiLL76 , thanks for your feedback. Could you help us refine the error message?

Maybe it's worth here (https://github.com/open-mmlab/mmengine/blob/main/mmengine/optim/scheduler/param_scheduler.py#L1505), not using raise, but calling warning? Because metrics (if it's not loss) are calculated once per epoch (during validation). And the scheduler itself can be called outside of validation.

MiXaiLL76 avatar Aug 07 '24 09:08 MiXaiLL76