burn
burn copied to clipboard
Add common LR schedulers
Feature description
Right now we support LR schedulers but only provide a constant schedule or the Noam schedule.
Would be nice if we added a couple of common/popular LR schedulers:
- [ ] Step LR
- [x] Cosine LR
- [x] Exponential LR
- [x] Linear LR
- [ ] Sequential LR schedulers (useful for combining things like warmup + cosine or linear decrease schedule)
Hi, just happen to know those scheduler a liitle bit. I could have this.
I could have a go at the step, exponential and cosine (maybe some variants, like with annealing) ones, since the linear one is already merged.
@rubenjr0 Yes, please go ahead.
Hi,
I have a question about the Step LR scheduler, which I'm making a PR for. My question is, say the initial learning rate and gamma used to create the config instance is 0.5 and 0.1 respectively, should the first value LrScheduler::step returns be 0.5 or 0.05?
The reason why I'm asking it is the following schedulers never return the initial_lr value in the config class:
Exponential LR
Linear LR
Cosine LR
I read the code, and it seems that the only way to obtain a learning rate is to call the step method. The first time step gets called, it uses the initial value to compute and return the next learning rate (I am new to Burn, so please correct me if I am missing something obvious).
Given the name initial_lr of the field, I think it's natural to expect that it will be used as the first learning rate value, so I am inclined to choose 0.5 as the answer to my question, but I am not sure about not following what existing code does.
Any advice would be greatly appreciated.
@towerpark After taking a look at the code it looks like the first step call would return 0.05.
In the tests for ExponentialLR we can see how the first step returns a learning rate smaller than the initial one:
let mut scheduler = ExponentialLrSchedulerConfig::new(INITIAL_LR, GAMMA).init();
let mut previous_lr = INITIAL_LR;
for _ in 0..NUM_ITERS {
let lr = LrScheduler::<TestBackend>::step(&mut scheduler);
assert!(
lr < previous_lr,
"Learning rate should decrease with each iteration before reaching the final learning rate"
);
previous_lr = lr;
}
I agree that having the first step return the initial learning rate is a bit more intuitive. I merely looked at how previously implemented schedulers were working (like the Linear scheduler) to keep things consistent in the Exponential and Cosine schedulers.
Perhaps @laggui or @antimora could chime in? If they think it'd be worth it to change the behavior of the schedulers to return the initial learning rate on the first step I'd be open to make a new PR 😄
@rubenjr0
I do think following existing code for consistency is important, and that's why I'm here asking :)
Hmmm I think I would agree that the schedule should start with the initial LR. Right now we're essentially skipping the initial step.
We could possibly add a scheduler.lr() function to the scheduler trait which would return the current LR. That way, the initial LR can be retrieved before the first scheduler.step(). And if we keep the .step() as it is (which returns the updated state), then the changes won't break the current usage.
Opened to your suggestions as well 🙂
We could possibly add a
scheduler.lr()function to the scheduler trait which would return the current LR. That way, the initial LR can be retrieved before the firstscheduler.step(). And if we keep the.step()as it is (which returns the updated state), then the changes won't break the current usage.
I think that is a graceful way to fix the issue for Cosine, Exponential and Linear LRs, but the problem is currently step() of Noam LR does return the first value of learning rates when it gets called for the first time.
It looks like something has to be broken if we want to change the current behavior. Maybe choose the less widely used side to break if we are really going to do this?
I think doing something like:
let curr = self.lr;
self.lr = // update lr
return curr;
On every scheduler would be a simple fix, although not a very elegant one. Nothing would really break, other than reproducibility, right?
I might be wrong. Will take a look and experiment a bit as soon as I can.
I don't necessarily mind breaking the current usage, as long as it is valid. The suggested solution to update the state but return the previous state in .step() doesn't feel very intuitive.
I think that is a graceful way to fix the issue for Cosine, Exponential and Linear LRs, but the problem is currently
step()of Noam LR does return the first value of learning rates when it gets called for the first time.
You're right, though this could probably be fixed by initializing the step at 1.0 instead of 0.0 (anyway, the value isn't valid for zero).
I think the other approach would be to initialize all schedulers such that the first step returns the initial value. That's probably a better approach imo 🤔
I second making all the .step() methods return the initial value on the first call if we're going to make the behavior consistent across all the schedulers, because:
-
Pros: Provide a unified way to get all learning rate values, i.e., just call
.step(); No need to change the code of training loop to fetch the initial value by a separate method; Not hard to implement. -
Cons: It will be a breaking change for Cosine, Exponential, Linear schedulers, but users can still reproduce the current learning rate sequence if they want by calling
.step()once to discard the first value before passing the scheduler toLearnerBuilder::build().
UPDATE: Clarification.
Thanks for breaking down the pros and cons!
I think this is a good place to start if you want to tackle this change 🙂
I would be happy to make a PR for this change :) I will start working on it in the next few days.
I think I'll try my hand at implementing the sequential LR scheduler.
Is it correct to assume "Sequential LR schedulers" is just changing the scheduler used depending on the epoch?
Is it correct to assume "Sequential LR schedulers" is just changing the scheduler used depending on the epoch?
Yeah! It's just a way to combine different LR schedule based on some milestone steps.
A simple example would be to use a linear LR warmup for steps 0 to N, followed by a cosine LR from N until the end.
Gotcha gotcha thank you for the clarification. One more thing would we want to reinitialize the learning rate when the scheduler is changed or just pass the current learning rate to the next scheduler?
On Fri, Sep 27, 2024, 2:25 PM Guillaume Lagrange @.***> wrote:
Is it correct to assume "Sequential LR schedulers" is just changing the scheduler used depending on the epoch?
Yeah! It's just a way to combine different LR schedule based on some milestone steps.
A simple example would be to use a linear LR warmup for steps 0 to N, followed by a cosine LR from N until the end.
— Reply to this email directly, view it on GitHub https://github.com/tracel-ai/burn/issues/1198#issuecomment-2379832932, or unsubscribe https://github.com/notifications/unsubscribe-auth/ALHYMCVFXD3BWFSRISJFWLDZYWPKVAVCNFSM6AAAAABCTEEVJSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNZZHAZTEOJTGI . You are receiving this because you commented.Message ID: @.***>
I have a question about what should be included when saving a scheduler into a record: Should I save all the struct fields or only fields that change across steps during training? The latter is what Noam and constant scheduler do, while the rest schedulers go with the former. E.g., the exponential scheduler has two fields: the current learning rate and gamma. At each step, the learning rate is updated by multiplying gamma before getting returned. Then my question is: Should gamma, which remains unchanged during training, be saved into a record?
My first thought is saving all fields feels more intuitive, and it seems that .into_record() methods generated by deriving the Module trait do so too. However, I cannot find any sources in the Book and issues that back up the idea.
Actually, this leads me to a broader question: How are records supposed to be used?
- (a) The state of the training environment can be reconstructed with records alone. If this is true, then all fields have to be saved.
- (b) Records are supposed to be combined with the same initial settings of the training environment to reconstruct the saved state. If so, saving only fields that change during training is enough.
I know saving all fields works both ways, but it would be great to have a better understanding of the record system. Could you please give me some guidance?
Normally records are supposed to be combined with hyper-parameters, which are supposed to be training configs.
Gotcha gotcha thank you for the clarification. One more thing would we want to reinitialize the learning rate when the scheduler is changed or just pass the current learning rate to the next scheduler?
The next scheduler is typically initialized from the current state (i.e., the final learning rate given by the previous scheduler). A visual example for a linear LR warmup followed by another schedule:

Normally records are supposed to be combined with hyper-parameters, which are supposed to be training configs.
Thank you for addressing my question.
Since I need to touch the .to_record() and .load_record() methods in my PR that changes the first value of those three LR schedulers, I will make their behaviors consistent with other schedulers by saving only fields that change across steps.