DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

ZeRO Gradient Accumulation Dtype.

Open jomayeri opened this issue 2 years ago • 6 comments

Adding configurable gradient accumulation data type to ZeRO stages 1 and 2. Added two gradient attributes to the parameter object:

  1. param.grad_reduc: Used only as a pointer to reference which gradient tensor should be used in reduction. The tensor object pointed to by this attribute is used in all reduction operations.
  2. param.grad_accum: Used as a pointer or standalone tensor representing the accumulated gradients. The tensor object pointed to by this attribute is used in all accumulation operations.

Two stages 1 & 2. Two modes:

  1. model_dtype == gradient_accumulation_dtype
  2. model_dtype != gradient_accumulation_dtype

Stage 1, Mode 1:

  • param.grad_reduc = param.grad
  • param.grad_accum = param.grad

Stage 1, Mode 2:

  • param.grad_reduc = param.accum_grad
  • param.grad_accum = standalone tensor

Stage 2, Mode 1:

  • param.grad_reduc = param.grad
  • param.grad_accum = param.grad

Stage 2, Mode 2:

  • param.grad_reduc = param.grad
  • param.grad_accum = standalone tensor

jomayeri avatar Feb 17 '23 02:02 jomayeri

@stas00, FYI

tjruwase avatar Feb 21 '23 13:02 tjruwase

Thank you very much for working on this important feature, Joe!

stas00 avatar Feb 21 '23 18:02 stas00

This is much easier to follow, @jomayeri - thank you for the renames!

stas00 avatar Feb 23 '23 02:02 stas00

@jomayeri, if I may share a useful gh trick with you - you might want to add:

Fixes: https://github.com/microsoft/DeepSpeed/issues/2352

to the OP and it'll automatically link to close the issue on the merge of this one. And the others will also know which PR to try out when they land on the Issue page.

stas00 avatar Feb 24 '23 06:02 stas00

bros, can we just merge in master, and test by ourselves.

danyang-rainbow avatar Mar 07 '23 06:03 danyang-rainbow

so, who can help to move this pr on.QAQ

danyang-rainbow avatar Jun 02 '23 09:06 danyang-rainbow

@tjruwase It is my understanding that when training with BF16 enabled this PR means gradients are now accumulated in FP32, but master weights are still stored as BF16. Is this true? Has this been shown to not hurt model performance? If not, is there a way to store the master weights as FP32 as is done in DeepSpeed's FP16 mixed precision training?

zaptrem avatar Jul 29 '24 23:07 zaptrem

@tjruwase It is my understanding that when training with BF16 enabled this PR means gradients are now accumulated in FP32, but master weights are still stored as BF16. Is this true? Has this been shown to not hurt model performance? If not, is there a way to store the master weights as FP32 as is done in DeepSpeed's FP16 mixed precision training?

This PR does not affect the master weight precision, fp32, in the optimizer. It seems provides an option to do gradient accumulation in fp32, even though fwd/bwd is in bf16 which creates bf16 gradients.

tjruwase avatar Jul 29 '24 23:07 tjruwase

@tjruwase It is my understanding that when training with BF16 enabled this PR means gradients are now accumulated in FP32, but master weights are still stored as BF16. Is this true? Has this been shown to not hurt model performance? If not, is there a way to store the master weights as FP32 as is done in DeepSpeed's FP16 mixed precision training?

This PR does not affect the master weight precision, fp32, in the optimizer. It seems provides an option to do gradient accumulation in fp32, even though fwd/bwd is in bf16 which creates bf16 gradients.

Thanks. To be clear, are you saying master weights are stored in BF16 or FP32 when BF16 is enabled? I tried to switch my FP16 run (which afaik saves weights as FP32?) to BF16 and got an error regarding missing BF16 checkpoint files. When I manually initialized in BF16 by loading the params converted to FP32, the loss jumped a little and quickly started coming back down again.

zaptrem avatar Jul 30 '24 00:07 zaptrem