transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Overestimated number of training epochs in Trainer

Open fenchri opened this issue 1 year ago • 1 comments

System Info

  • transformers version: 4.26.0
  • Platform: Linux-5.4.0-136-generic-x86_64-with-glibc2.17
  • Python version: 3.8.16
  • Huggingface_hub version: 0.12.1
  • PyTorch version (GPU?): 1.12.0+cu113 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: Yes
  • Using distributed or parallel set-up in script?: Yes

Who can help?

@sgugger

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

Tasks

  • [X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • [ ] My own task or dataset (give details below)

Reproduction

Under certain circumstances, given max_steps and dataloader size non-divisible by gradient_accumulation_steps, the number of epochs printed during model training can be overestimated, even if dataloader_drop_last is set to False.

On an example run with the following inputs, Trainer calculated 100 training epochs instead of 87.

python run_clm.py \
	--model_name_or_path gpt2 \
	--dataset_name wikitext \
	--dataset_config_name wikitext-103-raw-v1 \
	--per_device_train_batch_size 2 \
	--per_device_eval_batch_size 1 \
	--do_train \
	--output_dir /tmp/test-clm \
	--max_train_samples=148 \
	--gradient_accumulation_steps=32 \
	--overwrite_output_dir \
	--max_steps=200 \
	--logging_steps=10 \
        --dataloader_drop_last=False

[INFO|trainer.py:1650] 2023-03-11 15:21:27,133 >> ***** Running training *****
[INFO|trainer.py:1651] 2023-03-11 15:21:27,133 >>   Num examples = 148
[INFO|trainer.py:1652] 2023-03-11 15:21:27,133 >>   Num Epochs = 100
[INFO|trainer.py:1653] 2023-03-11 15:21:27,133 >>   Instantaneous batch size per device = 2
[INFO|trainer.py:1654] 2023-03-11 15:21:27,133 >>   Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1655] 2023-03-11 15:21:27,133 >>   Gradient Accumulation steps = 32
[INFO|trainer.py:1656] 2023-03-11 15:21:27,133 >>   Total optimization steps = 200
[INFO|trainer.py:1657] 2023-03-11 15:21:27,133 >>   Number of trainable parameters = 124439808

I believe this happens due to the computation here and consequently here.

Expected behavior

Expected the estimated number of epochs to be closer to the actual number of epochs. Perhaps in that case num_train_epochs can be computed as:

update_steps_per_epoch = len_dataloader / args.gradient_accumulation_steps
num_train_epochs = math.ceil(args.max_steps / update_steps_per_epoch)

Thank you in advance!

fenchri avatar Mar 11 '23 17:03 fenchri

Thanks for the report! Would you like to suggest a PR with your fix?

sgugger avatar Mar 13 '23 13:03 sgugger

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Apr 11 '23 15:04 github-actions[bot]