DeepSpeed
DeepSpeed copied to clipboard
ZeRO Gradient Accumulation Dtype.
Adding configurable gradient accumulation data type to ZeRO stages 1 and 2. Added two gradient attributes to the parameter object:
-
param.grad_reduc
: Used only as a pointer to reference which gradient tensor should be used in reduction. The tensor object pointed to by this attribute is used in all reduction operations. -
param.grad_accum
: Used as a pointer or standalone tensor representing the accumulated gradients. The tensor object pointed to by this attribute is used in all accumulation operations.
Two stages 1 & 2. Two modes:
-
model_dtype == gradient_accumulation_dtype
-
model_dtype != gradient_accumulation_dtype
Stage 1, Mode 1:
-
param.grad_reduc = param.grad
-
param.grad_accum = param.grad
Stage 1, Mode 2:
-
param.grad_reduc = param.accum_grad
-
param.grad_accum = standalone tensor
Stage 2, Mode 1:
-
param.grad_reduc = param.grad
-
param.grad_accum = param.grad
Stage 2, Mode 2:
-
param.grad_reduc = param.grad
-
param.grad_accum = standalone tensor
@stas00, FYI
Thank you very much for working on this important feature, Joe!
This is much easier to follow, @jomayeri - thank you for the renames!
@jomayeri, if I may share a useful gh trick with you - you might want to add:
Fixes: https://github.com/microsoft/DeepSpeed/issues/2352
to the OP and it'll automatically link to close the issue on the merge of this one. And the others will also know which PR to try out when they land on the Issue page.
bros, can we just merge in master, and test by ourselves.
so, who can help to move this pr on.QAQ
@tjruwase It is my understanding that when training with BF16 enabled this PR means gradients are now accumulated in FP32, but master weights are still stored as BF16. Is this true? Has this been shown to not hurt model performance? If not, is there a way to store the master weights as FP32 as is done in DeepSpeed's FP16 mixed precision training?
@tjruwase It is my understanding that when training with BF16 enabled this PR means gradients are now accumulated in FP32, but master weights are still stored as BF16. Is this true? Has this been shown to not hurt model performance? If not, is there a way to store the master weights as FP32 as is done in DeepSpeed's FP16 mixed precision training?
This PR does not affect the master weight precision, fp32, in the optimizer. It seems provides an option to do gradient accumulation in fp32, even though fwd/bwd is in bf16 which creates bf16 gradients.
@tjruwase It is my understanding that when training with BF16 enabled this PR means gradients are now accumulated in FP32, but master weights are still stored as BF16. Is this true? Has this been shown to not hurt model performance? If not, is there a way to store the master weights as FP32 as is done in DeepSpeed's FP16 mixed precision training?
This PR does not affect the master weight precision, fp32, in the optimizer. It seems provides an option to do gradient accumulation in fp32, even though fwd/bwd is in bf16 which creates bf16 gradients.
Thanks. To be clear, are you saying master weights are stored in BF16 or FP32 when BF16 is enabled? I tried to switch my FP16 run (which afaik saves weights as FP32?) to BF16 and got an error regarding missing BF16 checkpoint files. When I manually initialized in BF16 by loading the params converted to FP32, the loss jumped a little and quickly started coming back down again.