Cosine decay schedule changes with different accumulate_grad_batches values
When using the accumulate_grad_batches flag in a Pytorch Lightning trainer, I noticed it changes my scheduler LR differently. I am using TIMM's cosine decay scheduler. And when I use 1 gpu, I set accumulate_grad_batches = 4, and when I use 4 gpus, I set accumulate_grad_batches = 1. Then I noticed the LR scheduler (in wandb) displayed different LR during training. For the accumulate_grad_batches = 4, the LR decayed faster.
My understand is the schedule should only depend on the epoch number, not the batch size or accumulate_grad_batches value. Did anyone else notice something similar before?
I am using PTL version 1.5, torch 1.11, and TIMM 0.5.4. I am training videos (HMDB) using pytorchvideo library as well.
Thanks!
@exnx scheduler will only depend on epoch number if you step once per epoch, if you use the per step lr update it'll change with the number of steps ... and since you're using PTL, likely depedent on how that's working...
There are some known limitations of the current timm schedulers, including better handling of differentiating between per update step scheduling vs per epoch, but this issue here is likely just a mismatch between PTL and timm expectations