pytorch-lightning icon indicating copy to clipboard operation
pytorch-lightning copied to clipboard

ReduceLROnPlateu within configure_optimizers behave abnormally

Open SeanZhang99 opened this issue 7 months ago • 2 comments

Bug description

Got error

  File "c:\Users\sean\miniconda3\envs\keras+torch+pl\Lib\site-packages\lightning\pytorch\loops\training_epoch_loop.py", line 459, in _update_learning_rates
    raise MisconfigurationException(
lightning.fabric.utilities.exceptions.MisconfigurationException: ReduceLROnPlateau conditioned on metric val/loss which is not available. Available metrics are: ['lr-AdamW/pg1', 'lr-AdamW/pg2', 'train/a_pcc', 'train/loss']. Condition can be set using `monitor` key in lr scheduler dict

Here is the configure_optimizers function:

    @final
    def configure_optimizers(self):

        decay, no_decay = [], []
        for name, param in self.named_parameters():
            if not param.requires_grad:
                continue
            if "bias" in name or "Norm" in name:
                no_decay.append(param)
            else:
                decay.append(param)

        grouped_params = [
            {"params": decay, "weight_decay": self.weight_decay, "lr": self.lr * 0.3},
            {
                "params": no_decay,
                "weight_decay": self.weight_decay,
                "lr": self.lr * 1.7,
            },
        ]

        optimizer = self.optmizer_class(
            grouped_params, lr=self.lr, weight_decay=self.weight_decay
        )

        scheduler = self.lr_scheduler_class(
            optimizer, **self.lr_scheduler_args if self.lr_scheduler_args else {}
        )
        scheduler = {
            "scheduler": self.lr_scheduler_class(
                optimizer, **self.lr_scheduler_args if self.lr_scheduler_args else {}
            ),
            "monitor": "val/loss",
            "interval": "epoch",
            "frequency": 1,
            # "strict": False,
        }
        return {"optimizer": optimizer, "lr_scheduler": scheduler}

The lr_scheduler_class is passed in as

  lr_scheduler_class: torch.optim.lr_scheduler.ReduceLROnPlateau
  lr_scheduler_args:
    mode: min
    factor: 0.5
    patience: 10
    threshold: 0.0001
    threshold_mode: rel
    cooldown: 5
    min_lr: 1.e-9
    eps: 1.e-08

(using yaml and CLI, which, I think, is not the case here)

It seems that I got the error at the end of the training epoch, as I just see the progress bar reports train/loss. The validation epoch is not finished, but the scheduler is called.

I am quite sure that val/loss is available after validation epoch is finished, because progress bar can correctly display it.

What version are you seeing the problem on?

v2.5

Reproduced in studio

No response

How to reproduce the bug


Error messages and logs

# Error messages and logs here please

Environment

StatusCode : 200 StatusDescription : OK Content : # Copyright The Lightning AI team. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the... RawContent : HTTP/1.1 200 OK Connection: keep-alive Content-Security-Policy: default-src 'none'; style-src 'unsafe-inline'; sandbox Strict-Transport-Security: max-age=31536000 X-Content-Type-Options: nosniff ... Forms : {} Headers : {[Connection, keep-alive], [Content-Security-Policy, default-src 'none'; style-src 'unsafe-inline'; sandbox], [Strict-Transport-Security, max-age=31536000], [X-Content-Type-Options, nosniff]...} Images : {} InputFields : {} Links : {} ParsedHtml : mshtml.HTMLDocumentClass RawContentLength : 2775

More info

No response

SeanZhang99 avatar May 15 '25 11:05 SeanZhang99

Additional: also got

c:\Users\sean\miniconda3\envs\keras+torch+pl\Lib\site-packages\lightning\pytorch\callbacks\model_checkpoint.py:384: `ModelCheckpoint(monitor='val/loss')` could not find the monitored   
key in the returned metrics: ['lr-AdamW/pg1', 'lr-AdamW/pg2', 'train/a_pcc', 'train/loss', 'epoch', 'step']. HINT: Did you call `log('val/loss', value)` in the `LightningModule`?   

which may indicate that these functions are not handled in the correct time

I was somehow quite sure this message came out at the end of the training epoch, because when I set save_on_train_epoch_end: false for the ckpt callback, I would not receive this message

SeanZhang99 avatar May 15 '25 11:05 SeanZhang99

can you share the code from which you call 'log(val/loss)'?

maoragai avatar May 31 '25 17:05 maoragai

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions - the Lightning Team!

stale[bot] avatar Jul 19 '25 06:07 stale[bot]