transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Correct zero division error in inverse sqrt scheduler

Open DavidAfonsoValente opened this issue 1 year ago • 8 comments

What does this PR do?

Fixes #28835

Before submitting

  • [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [X] Did you read the contributor guideline, Pull Request section?
  • [X] Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • [ ] Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • [ ] Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag members/contributors who may be interested in your PR.

DavidAfonsoValente avatar Feb 12 '24 17:02 DavidAfonsoValente

Hi @DavidAfonsoValente, thanks for opening this PR!

Could you give some more context about the issue this resolves, ideally with a reproducible snippet?

Just looking at the PR, it implies that timescale is 0, which I don't think should ever be the case.

amyeroberts avatar Feb 12 '24 20:02 amyeroberts

I corrected the description link to the issue, after testing it seems like both the timescale and (current_step + shift) can possibly be zero, leading to the zero division error

DavidAfonsoValente avatar Feb 12 '24 22:02 DavidAfonsoValente

Hi @DavidAfonsoValente, thanks for linking to the relevant issue. I still have an outstanding question about how this can occur i.e. when is timescale 0?

cc @muellerzr who can provide some more context over the behaviour

amyeroberts avatar Feb 13 '24 18:02 amyeroberts

@DavidAfonsoValente reading https://github.com/huggingface/transformers/pull/21495, correct me if I'm wrong but the whole scheduler works off a timescale which is equal to the number of warmup steps. Which, at least in my understanding, means that we can't have a timescale of 0 and thus should raise a NotImplementedError directing the user to ensure they have a number of warmup steps set in the scheduler, no?

muellerzr avatar Feb 13 '24 19:02 muellerzr

cc @Sangh0

muellerzr avatar Feb 13 '24 19:02 muellerzr

Yes, this problem occurs when num_warmup_steps is 0, the check that is made is to ensure that num_warmup_steps is not None so it goes through. However most of the training examples provided with get_scheduler initiate the scheduler with num_warmup_steps = 0. One other possible correction could be defaulting the timescale to 10_000 as it is done in : https://github.com/google-research/big_vision/blob/fd2d3bd2efc9d89ea959f16cd2f58ae8a495cd44/big_vision/configs/proj/clippo/train_clippo.py#L144 https://github.com/google-research/big_vision/blob/6ff6d080d62c1f47e2e4eeb8b6474deb38dfe406/big_vision/configs/proj/scaling_laws/train_vit_g.py#L79 I believe maybe the current implementation came from interpreting the original implementation as having timescale==num_warmup_steps however a more accurate implementation could be one where these both default to 10_000, what do you think?

DavidAfonsoValente avatar Feb 13 '24 21:02 DavidAfonsoValente

@muellerzr Thank you! I'm glad this issue has been resolved well.

Sangh0 avatar Feb 16 '24 08:02 Sangh0

Should I default timescale to 10_000 instead of the current solution?

DavidAfonsoValente avatar Feb 16 '24 21:02 DavidAfonsoValente

@muellerzr is off atm, so we'll have to wait from him to confirm. From what I understand, yes, let's a better default for timescale. For backwards compatibility, I'd suggest default to 10_000 it num_warmup_step is 0 and timescale is not set.

amyeroberts avatar Feb 19 '24 19:02 amyeroberts

Agree with Amy here!

muellerzr avatar Feb 29 '24 20:02 muellerzr

@DavidAfonsoValente can you rebase from main and push with -f (to not have the git commit history/diff all crazy) so we can double check tests are green? 🤗 Thanks!

muellerzr avatar Mar 01 '24 12:03 muellerzr

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Great, thank you!

DavidAfonsoValente avatar Mar 01 '24 17:03 DavidAfonsoValente