System Info

transformers version: 4.26.0
Platform: Linux-4.18.0-372.26.1.el8_6.x86_64-x86_64-with-glibc2.28
Python version: 3.9.16
Huggingface_hub version: 0.12.0
PyTorch version (GPU?): 1.12.1 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@sgugger

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[X] My own task or dataset (give details below)

Reproduction

Issue:

When setting auto_find_batch_size=True in TrainingArguments, if the batch_size decays because batches can't fit in memory, the learning_rate scheduler won't get updated.

In my case, I finetune bert-base-cased on a custom dataset using:

training_args = TrainingArguments(
    per_device_train_batch_size=512,
    auto_find_batch_size=True,
    num_train_epochs=3
)

My batch_size decays three times, and the learning rate decays to zero before the end of the first training epoch.

Trace:

Line 1162 of trainer.py is if self.lr_scheduler is None:, on its first call it evaluates to True, but when the batch_size decays, it is called again but this time it evaluates to False which prevents the lr_scheduler from being updated on line 1163.

I think we could replace it by: if self.args.optimizers[1] is None:

Expected behavior

If max_steps changes because the batch_size decays, the lr_scheduler should be updated to reflect this change. Using the default linear lr_scheduler, I expect the learning rate to go from its initial value at the beginning of training to zero at the end of training.

Feb 08 '23 17:02 thomas-schillaci

cc @muellerzr

Feb 08 '23 17:02 sgugger

@muellerzr I would like to pick up this issue and fix it, Looking to write a failing testcase for this bug, Any pointers ?

Feb 27 '23 05:02 raghavanone

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

May 12 '23 15:05 github-actions[bot]

It's been a while @muellerzr , could you have a look?

May 16 '23 13:05 sgugger

@raghavanone @thomas-schillaci could you try building from main and seeing if that fixes the issue? I think https://github.com/huggingface/transformers/pull/24521 fixed this

Jul 06 '23 15:07 muellerzr

I'm not sure this fixes the issue, I'm going to comment on the PR directly

Jul 07 '23 11:07 thomas-schillaci

Same here. I built it from main.

Jul 10 '23 12:07 Kaveh8

just ran into this issue today, can confirm that it exists

Jul 12 '23 03:07 Broyojo

Hi all, try reinstalling from main, #24758 should have fixed it this time

Jul 12 '23 03:07 muellerzr

Hello @muellerzr, it works for me, thank you for the fix!

Jul 12 '23 07:07 thomas-schillaci

transformers
transformers copied to clipboard

lr_scheduler not updated when auto_find_batch_size set to True and batch_size decays

System Info

Who can help?

Information

Tasks

Reproduction

Issue:

Trace:

Expected behavior

transformers transformers copied to clipboard

lr_scheduler not updated when auto_find_batch_size set to True and batch_size decays

System Info

Who can help?

Information

Tasks

Reproduction

Issue:

Trace:

Expected behavior

transformers
transformers copied to clipboard