transformers
transformers copied to clipboard
Correct zero division error in inverse sqrt scheduler
What does this PR do?
Fixes #28835
Before submitting
- [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
- [X] Did you read the contributor guideline, Pull Request section?
- [X] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
- [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
- [ ] Did you write any new necessary tests?
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.
Hi @DavidAfonsoValente, thanks for opening this PR!
Could you give some more context about the issue this resolves, ideally with a reproducible snippet?
Just looking at the PR, it implies that timescale is 0, which I don't think should ever be the case.
I corrected the description link to the issue, after testing it seems like both the timescale and (current_step + shift) can possibly be zero, leading to the zero division error
Hi @DavidAfonsoValente, thanks for linking to the relevant issue. I still have an outstanding question about how this can occur i.e. when is timescale 0?
cc @muellerzr who can provide some more context over the behaviour
@DavidAfonsoValente reading https://github.com/huggingface/transformers/pull/21495, correct me if I'm wrong but the whole scheduler works off a timescale which is equal to the number of warmup steps. Which, at least in my understanding, means that we can't have a timescale of 0 and thus should raise a NotImplementedError directing the user to ensure they have a number of warmup steps set in the scheduler, no?
cc @Sangh0
Yes, this problem occurs when num_warmup_steps is 0, the check that is made is to ensure that num_warmup_steps is not None so it goes through. However most of the training examples provided with get_scheduler initiate the scheduler with num_warmup_steps = 0. One other possible correction could be defaulting the timescale to 10_000 as it is done in : https://github.com/google-research/big_vision/blob/fd2d3bd2efc9d89ea959f16cd2f58ae8a495cd44/big_vision/configs/proj/clippo/train_clippo.py#L144 https://github.com/google-research/big_vision/blob/6ff6d080d62c1f47e2e4eeb8b6474deb38dfe406/big_vision/configs/proj/scaling_laws/train_vit_g.py#L79 I believe maybe the current implementation came from interpreting the original implementation as having timescale==num_warmup_steps however a more accurate implementation could be one where these both default to 10_000, what do you think?
@muellerzr Thank you! I'm glad this issue has been resolved well.
Should I default timescale to 10_000 instead of the current solution?
@muellerzr is off atm, so we'll have to wait from him to confirm. From what I understand, yes, let's a better default for timescale. For backwards compatibility, I'd suggest default to 10_000 it num_warmup_step is 0 and timescale is not set.
Agree with Amy here!
@DavidAfonsoValente can you rebase from main and push with -f (to not have the git commit history/diff all crazy) so we can double check tests are green? 🤗 Thanks!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Great, thank you!