transformers icon indicating copy to clipboard operation
transformers copied to clipboard

lr_scheduler not updated when auto_find_batch_size set to True and batch_size decays

Open thomas-schillaci opened this issue 2 years ago • 2 comments

System Info

  • transformers version: 4.26.0
  • Platform: Linux-4.18.0-372.26.1.el8_6.x86_64-x86_64-with-glibc2.28
  • Python version: 3.9.16
  • Huggingface_hub version: 0.12.0
  • PyTorch version (GPU?): 1.12.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Who can help?

@sgugger

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [X] My own task or dataset (give details below)

Reproduction

Issue:

When setting auto_find_batch_size=True in TrainingArguments, if the batch_size decays because batches can't fit in memory, the learning_rate scheduler won't get updated.

In my case, I finetune bert-base-cased on a custom dataset using:

training_args = TrainingArguments(
    per_device_train_batch_size=512,
    auto_find_batch_size=True,
    num_train_epochs=3
)

My batch_size decays three times, and the learning rate decays to zero before the end of the first training epoch.

Trace:

Line 1162 of trainer.py is if self.lr_scheduler is None:, on its first call it evaluates to True, but when the batch_size decays, it is called again but this time it evaluates to False which prevents the lr_scheduler from being updated on line 1163.

I think we could replace it by: if self.args.optimizers[1] is None:

Expected behavior

If max_steps changes because the batch_size decays, the lr_scheduler should be updated to reflect this change. Using the default linear lr_scheduler, I expect the learning rate to go from its initial value at the beginning of training to zero at the end of training.

thomas-schillaci avatar Feb 08 '23 17:02 thomas-schillaci

cc @muellerzr

sgugger avatar Feb 08 '23 17:02 sgugger

@muellerzr I would like to pick up this issue and fix it, Looking to write a failing testcase for this bug, Any pointers ?

raghavanone avatar Feb 27 '23 05:02 raghavanone

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar May 12 '23 15:05 github-actions[bot]

It's been a while @muellerzr , could you have a look?

sgugger avatar May 16 '23 13:05 sgugger

@raghavanone @thomas-schillaci could you try building from main and seeing if that fixes the issue? I think https://github.com/huggingface/transformers/pull/24521 fixed this

muellerzr avatar Jul 06 '23 15:07 muellerzr

I'm not sure this fixes the issue, I'm going to comment on the PR directly

thomas-schillaci avatar Jul 07 '23 11:07 thomas-schillaci

Same here. I built it from main.

Kaveh8 avatar Jul 10 '23 12:07 Kaveh8

just ran into this issue today, can confirm that it exists

Broyojo avatar Jul 12 '23 03:07 Broyojo

Hi all, try reinstalling from main, #24758 should have fixed it this time

muellerzr avatar Jul 12 '23 03:07 muellerzr

Hello @muellerzr, it works for me, thank you for the fix!

thomas-schillaci avatar Jul 12 '23 07:07 thomas-schillaci