[trainer] fix bug in grad accum with multiple epochs

Open stas00 opened this issue 2 years ago • 1 comments

Please see https://github.com/huggingface/transformers/issues/22082 for the analysis printout of the problem.

But basically we have a bug in grad accum machinery when steps_in_epoch % gradient_accumulation_steps != 0

we always check for step+1 % gradient_accumulation_steps != 0 and when we hit the epoch boundary we end up running more than gradient_accumulation_steps in that iteration.

I proposed a fix using a total step counter - please feel free to suggest a different fix.

I left the debug prints if you'd like to validate the situation yourself. will remove when happy.

Fixes: https://github.com/huggingface/transformers/issues/22082

Mar 11 '23 00:03 stas00

The documentation is not available anymore as the PR was closed or merged.

Mar 11 '23 01:03 HuggingFaceDocBuilderDev