DeepSpeed ZeRO Gradient Accumulation Dtype.

Adding configurable gradient accumulation data type to ZeRO stages 1 and 2. Added two gradient attributes to the parameter object:

param.grad_reduc: Used only as a pointer to reference which gradient tensor should be used in reduction. The tensor object pointed to by this attribute is used in all reduction operations.
param.grad_accum: Used as a pointer or standalone tensor representing the accumulated gradients. The tensor object pointed to by this attribute is used in all accumulation operations.

Two stages 1 & 2. Two modes:

model_dtype == gradient_accumulation_dtype
model_dtype != gradient_accumulation_dtype

Stage 1, Mode 1:

param.grad_reduc = param.grad
param.grad_accum = param.grad

Stage 1, Mode 2:

param.grad_reduc = param.accum_grad
param.grad_accum = standalone tensor

Stage 2, Mode 1:

param.grad_reduc = param.grad
param.grad_accum = param.grad

Stage 2, Mode 2:

param.grad_reduc = param.grad
param.grad_accum = standalone tensor

Feb 17 '23 02:02 jomayeri

@stas00, FYI

Feb 21 '23 13:02 tjruwase

Thank you very much for working on this important feature, Joe!

Feb 21 '23 18:02 stas00

This is much easier to follow, @jomayeri - thank you for the renames!

Feb 23 '23 02:02 stas00

@jomayeri, if I may share a useful gh trick with you - you might want to add:

Fixes: https://github.com/microsoft/DeepSpeed/issues/2352

to the OP and it'll automatically link to close the issue on the merge of this one. And the others will also know which PR to try out when they land on the Issue page.

Feb 24 '23 06:02 stas00

bros, can we just merge in master, and test by ourselves.

Mar 07 '23 06:03 danyang-rainbow

so, who can help to move this pr on.QAQ

Jun 02 '23 09:06 danyang-rainbow

@tjruwase It is my understanding that when training with BF16 enabled this PR means gradients are now accumulated in FP32, but master weights are still stored as BF16. Is this true? Has this been shown to not hurt model performance? If not, is there a way to store the master weights as FP32 as is done in DeepSpeed's FP16 mixed precision training?

Jul 29 '24 23:07 zaptrem

@tjruwase It is my understanding that when training with BF16 enabled this PR means gradients are now accumulated in FP32, but master weights are still stored as BF16. Is this true? Has this been shown to not hurt model performance? If not, is there a way to store the master weights as FP32 as is done in DeepSpeed's FP16 mixed precision training?

This PR does not affect the master weight precision, fp32, in the optimizer. It seems provides an option to do gradient accumulation in fp32, even though fwd/bwd is in bf16 which creates bf16 gradients.

Jul 29 '24 23:07 tjruwase

@tjruwase It is my understanding that when training with BF16 enabled this PR means gradients are now accumulated in FP32, but master weights are still stored as BF16. Is this true? Has this been shown to not hurt model performance? If not, is there a way to store the master weights as FP32 as is done in DeepSpeed's FP16 mixed precision training?

This PR does not affect the master weight precision, fp32, in the optimizer. It seems provides an option to do gradient accumulation in fp32, even though fwd/bwd is in bf16 which creates bf16 gradients.

Thanks. To be clear, are you saying master weights are stored in BF16 or FP32 when BF16 is enabled? I tried to switch my FP16 run (which afaik saves weights as FP32?) to BF16 and got an error regarding missing BF16 checkpoint files. When I manually initialized in BF16 by loading the params converted to FP32, the loss jumped a little and quickly started coming back down again.

Jul 30 '24 00:07 zaptrem

DeepSpeed DeepSpeed copied to clipboard

ZeRO Gradient Accumulation Dtype.

DeepSpeed
DeepSpeed copied to clipboard