transformers
transformers copied to clipboard
Overestimated number of training epochs in Trainer
System Info
-
transformers
version: 4.26.0 - Platform: Linux-5.4.0-136-generic-x86_64-with-glibc2.17
- Python version: 3.8.16
- Huggingface_hub version: 0.12.1
- PyTorch version (GPU?): 1.12.0+cu113 (True)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: Yes
- Using distributed or parallel set-up in script?: Yes
Who can help?
@sgugger
Information
- [X] The official example scripts
- [ ] My own modified scripts
Tasks
- [X] An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
Under certain circumstances, given max_steps
and dataloader size non-divisible by gradient_accumulation_steps
, the number of epochs printed during model training can be overestimated, even if dataloader_drop_last
is set to False.
On an example run with the following inputs, Trainer calculated 100 training epochs instead of 87.
python run_clm.py \
--model_name_or_path gpt2 \
--dataset_name wikitext \
--dataset_config_name wikitext-103-raw-v1 \
--per_device_train_batch_size 2 \
--per_device_eval_batch_size 1 \
--do_train \
--output_dir /tmp/test-clm \
--max_train_samples=148 \
--gradient_accumulation_steps=32 \
--overwrite_output_dir \
--max_steps=200 \
--logging_steps=10 \
--dataloader_drop_last=False
[INFO|trainer.py:1650] 2023-03-11 15:21:27,133 >> ***** Running training *****
[INFO|trainer.py:1651] 2023-03-11 15:21:27,133 >> Num examples = 148
[INFO|trainer.py:1652] 2023-03-11 15:21:27,133 >> Num Epochs = 100
[INFO|trainer.py:1653] 2023-03-11 15:21:27,133 >> Instantaneous batch size per device = 2
[INFO|trainer.py:1654] 2023-03-11 15:21:27,133 >> Total train batch size (w. parallel, distributed & accumulation) = 64
[INFO|trainer.py:1655] 2023-03-11 15:21:27,133 >> Gradient Accumulation steps = 32
[INFO|trainer.py:1656] 2023-03-11 15:21:27,133 >> Total optimization steps = 200
[INFO|trainer.py:1657] 2023-03-11 15:21:27,133 >> Number of trainable parameters = 124439808
I believe this happens due to the computation here and consequently here.
Expected behavior
Expected the estimated number of epochs to be closer to the actual number of epochs. Perhaps in that case num_train_epochs
can be computed as:
update_steps_per_epoch = len_dataloader / args.gradient_accumulation_steps
num_train_epochs = math.ceil(args.max_steps / update_steps_per_epoch)
Thank you in advance!
Thanks for the report! Would you like to suggest a PR with your fix?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.