transformers
transformers copied to clipboard
lr_scheduler not updated when auto_find_batch_size set to True and batch_size decays
System Info
transformersversion: 4.26.0- Platform: Linux-4.18.0-372.26.1.el8_6.x86_64-x86_64-with-glibc2.28
- Python version: 3.9.16
- Huggingface_hub version: 0.12.0
- PyTorch version (GPU?): 1.12.1 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
@sgugger
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [X] My own task or dataset (give details below)
Reproduction
Issue:
When setting auto_find_batch_size=True in TrainingArguments, if the batch_size decays because batches can't fit in memory, the learning_rate scheduler won't get updated.
In my case, I finetune bert-base-cased on a custom dataset using:
training_args = TrainingArguments(
per_device_train_batch_size=512,
auto_find_batch_size=True,
num_train_epochs=3
)
My batch_size decays three times, and the learning rate decays to zero before the end of the first training epoch.
Trace:
Line 1162 of trainer.py is if self.lr_scheduler is None:, on its first call it evaluates to True, but when the batch_size decays, it is called again but this time it evaluates to False which prevents the lr_scheduler from being updated on line 1163.
I think we could replace it by:
if self.args.optimizers[1] is None:
Expected behavior
If max_steps changes because the batch_size decays, the lr_scheduler should be updated to reflect this change.
Using the default linear lr_scheduler, I expect the learning rate to go from its initial value at the beginning of training to zero at the end of training.
cc @muellerzr
@muellerzr I would like to pick up this issue and fix it, Looking to write a failing testcase for this bug, Any pointers ?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
It's been a while @muellerzr , could you have a look?
@raghavanone @thomas-schillaci could you try building from main and seeing if that fixes the issue? I think https://github.com/huggingface/transformers/pull/24521 fixed this
I'm not sure this fixes the issue, I'm going to comment on the PR directly
Same here. I built it from main.
just ran into this issue today, can confirm that it exists
Hi all, try reinstalling from main, #24758 should have fixed it this time
Hello @muellerzr, it works for me, thank you for the fix!