transformers [trainer] bug in resume and gas>1

https://github.com/huggingface/transformers/pull/22098 fixed the issue with GAS>1 at the epoch boundary.

the same bug will still happens at resume boundary, since total_batched_samples is currently reset to 0.

So need to save total_batched_samples and restore from the saved value on resume.

Mar 13 '23 19:03 stas00

Actually thought more about it this morning. The gradients accumulated before the save will be lost, so even if we save the total_batched_samples variable, we won't be able to resume training with the same gradients (they will be 0 instead of whatever was accumulated before the checkpoint).

So I think leaving the situation as is is okay, there is a tiny bit of training lost but it shouldn't impact convergence. And we should document somewhere that we do not guarantee checkpoints will not yield the exact same model using save_strategy="epoch" in conjunction with gradient accumulation.

Mar 14 '23 14:03 sgugger

oh, I wrongly assumed that they were saved. Yes, then it makes sense. There will be no miscalculation then, just some very minor intermediary results loss. I think it's all good.

Mar 14 '23 18:03 stas00

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 13 '23 15:04 github-actions[bot]