trl Gradient accumulation yields worse results than the equivalent batch size

Gradient accumulation yields worse results than the equivalent batch size

Open benjamin-marie opened this issue 4 months ago • 20 comments

I expected a training configuration with per_device_train_batch_size=1 and gradient_accumulation_steps=32 to yield the same (or similar) result to per_device_train_batch_size=32 and gradient_accumulation_steps=1 but that's not the case, the former is much worse. I ran several experiments with SmolLM-135M and Llama 3.2 1B, using always the same seed, and the results are consistent with this observation.

Maybe I misunderstand something here?

My training code is in this Colab notebook. I ran this notebook to draw the learning curves above, restarting the notebook between each training to avoid OOM. Note that I have the same observations with Qwen2.

Oct 04 '24 15:10 benjamin-marie

trl trl copied to clipboard

Gradient accumulation yields worse results than the equivalent batch size

trl
trl copied to clipboard