ReduceLROnPlateu within configure_optimizers behave abnormally
Bug description
Got error
File "c:\Users\sean\miniconda3\envs\keras+torch+pl\Lib\site-packages\lightning\pytorch\loops\training_epoch_loop.py", line 459, in _update_learning_rates
raise MisconfigurationException(
lightning.fabric.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric val/loss which is not available. Available metrics are: ['lr-AdamW/pg1', 'lr-AdamW/pg2', 'train/a_pcc', 'train/loss']. Condition can be set using `monitor` key in lr scheduler dict
Here is the configure_optimizers function:
@final
def configure_optimizers(self):
decay, no_decay = [], []
for name, param in self.named_parameters():
if not param.requires_grad:
continue
if "bias" in name or "Norm" in name:
no_decay.append(param)
else:
decay.append(param)
grouped_params = [
{"params": decay, "weight_decay": self.weight_decay, "lr": self.lr * 0.3},
{
"params": no_decay,
"weight_decay": self.weight_decay,
"lr": self.lr * 1.7,
},
]
optimizer = self.optmizer_class(
grouped_params, lr=self.lr, weight_decay=self.weight_decay
)
scheduler = self.lr_scheduler_class(
optimizer, **self.lr_scheduler_args if self.lr_scheduler_args else {}
)
scheduler = {
"scheduler": self.lr_scheduler_class(
optimizer, **self.lr_scheduler_args if self.lr_scheduler_args else {}
),
"monitor": "val/loss",
"interval": "epoch",
"frequency": 1,
# "strict": False,
}
return {"optimizer": optimizer, "lr_scheduler": scheduler}
The lr_scheduler_class is passed in as
lr_scheduler_class: torch.optim.lr_scheduler.ReduceLROnPlateau
lr_scheduler_args:
mode: min
factor: 0.5
patience: 10
threshold: 0.0001
threshold_mode: rel
cooldown: 5
min_lr: 1.e-9
eps: 1.e-08
(using yaml and CLI, which, I think, is not the case here)
It seems that I got the error at the end of the training epoch, as I just see the progress bar reports train/loss. The validation epoch is not finished, but the scheduler is called.
I am quite sure that val/loss is available after validation epoch is finished, because progress bar can correctly display it.
What version are you seeing the problem on?
v2.5
Reproduced in studio
No response
How to reproduce the bug
Error messages and logs
# Error messages and logs here please
Environment
StatusCode : 200 StatusDescription : OK Content : # Copyright The Lightning AI team. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the... RawContent : HTTP/1.1 200 OK Connection: keep-alive Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; sandbox Strict-Transport-Security: max-age=31536000 X-Content-Type-Options: nosniff ... Forms : {} Headers : {[Connection, keep-alive], [Content-Security-Policy, default-src 'none'; style-src 'unsafe-inline'; sandbox], [Strict-Transport-Security, max-age=31536000], [X-Content-Type-Options, nosniff]...} Images : {} InputFields : {} Links : {} ParsedHtml : mshtml.HTMLDocumentClass RawContentLength : 2775
More info
No response
Additional: also got
c:\Users\sean\miniconda3\envs\keras+torch+pl\Lib\site-packages\lightning\pytorch\callbacks\model_checkpoint.py:384: `ModelCheckpoint(monitor='val/loss')` could not find the monitored
key in the returned metrics: ['lr-AdamW/pg1', 'lr-AdamW/pg2', 'train/a_pcc', 'train/loss', 'epoch', 'step']. HINT: Did you call `log('val/loss', value)` in the `LightningModule`?
which may indicate that these functions are not handled in the correct time
I was somehow quite sure this message came out at the end of the training epoch, because when I set save_on_train_epoch_end: false for the ckpt callback, I would not receive this message
can you share the code from which you call 'log(val/loss)'?
This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!