pytorch-lightning Stepwise LR-Scheduler not working across epochs

Bug description

Description

I'm training a model based on number of iterations instead of a number of epochs. The same model trains on datasets of different sizes, hence one epoch differs in the number of iterations. Let's say I want to train e.g. a model for 900 iterations which corresponds to 90 epochs on one of the datasets and want to have a stepwise lr scheduler on iteration 300 & 600. To my understanding this is not natively possible in the pytorch lightning environment. I know that I can change the lr scheduler interval to "step" and then set the frequency, like so:

'lr_scheduler': {"scheduler": sched, "interval": "step", "frequency": 300}

However this only applies the steps within one epoch. If I set the frequency larger than the number of iteration per epoch no scheduler step is applied. I would assume that the expected behaviour is to call the scheduler.step() every n frequency across multiple epochs.

What version are you seeing the problem on?

v2_0

How to reproduce the bug

import os

import torch
from torch.utils.data import DataLoader, Dataset
from pytorch_lightning import LightningModule, Trainer


class RandomDataset(Dataset):
    def __init__(self, size, length):
        self.len = length
        self.data = torch.randn(length, size)

    def __getitem__(self, index):
        return self.data[index]

    def __len__(self):
        return self.len


class BoringModel(LightningModule):
    def __init__(self):
        super().__init__()
        self.layer = torch.nn.Linear(32, 2)

    def forward(self, x):
        return self.layer(x)

    def training_step(self, batch, batch_idx):
        loss = self(batch).sum()

        for param_group in self.optimizers().optimizer.param_groups:
            lr = param_group['lr']
        self.log('lr', lr, prog_bar=True, on_step=True, on_epoch=False)
        return {"loss": loss}

    def configure_optimizers(self):
        opt = torch.optim.SGD(self.layer.parameters(), lr=0.1)
        scheduler = torch.optim.lr_scheduler.StepLR(opt, 1)
        return {"optimizer": opt, 'lr_scheduler': {"scheduler": scheduler,
                                                   "interval": "step",
                                                   "frequency": 10}}


def run():
    train_data = DataLoader(RandomDataset(32, 32), batch_size=8)

    model = BoringModel()
    trainer = Trainer(
        accelerator='cpu',
        default_root_dir=os.getcwd(),
        num_sanity_val_steps=0,
        max_epochs=-1,
        max_steps=30,
        log_every_n_steps=1
    )
    trainer.fit(model, train_dataloaders=train_data)


if __name__ == "__main__":
    run()

May 02 '23 11:05 maltesilber

same problem

Jun 09 '23 08:06 z13670

Solved it using the LamdaLR scheduler. First define a function that corresponds to your lr schedule:

def step_decay(base_lr, step_size, gamma):
    def fn(step):
        return base_lr*gamma**(step//step_size)
    return fn

And configure the optimizer so the function gets called on every step:

def configure_optimizers(self):
     lr = 0.5
     optimizer = torch.optim.SGD(self.layer.parameters(), lr=lr)
     scheduler = torch.optim.lr_scheduler.LambdaLR(
         optimizer=optimizer,
         lr_lambda=step_decay(base_lr=lr, step_size=10, gamma=0.1)
     )
     return [optimizer], [{'scheduler': scheduler, 'interval': 'step'}]

Jul 25 '23 13:07 maltesilber

I think it is a reasonable ask for the frequency parameter to apply across epoch boundaries. This is an easy change, here in this line of code https://github.com/Lightning-AI/lightning/blob/af852ff5908e9a99917eeeff05bb4536dbb1cade/src/lightning/pytorch/loops/training_epoch_loop.py#L363

the self.batch_idx would have to be changed to self.total_batch_idx, that's all. Anyone from the community is free to contribute this change.

Nov 25 '23 23:11 awaelchli