DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[REQUEST] BF16 mixed precision => grad accum in fp32

Open stas00 opened this issue 1 year ago • 8 comments

Is your feature request related to a problem? Please describe.

We have proven with the BLOOM training that BF16 is by far more superior for mixed precision training than FP16, using Megatron-Deepspeed

But the latter is very complex, it'd be much easier for folks to use standalone ZeRO for training in bf16 mixed precision.

But for this to work we need ZeRO to support grad accumulation in fp32, similar to the recently added BF16Optimizer.

So this is the feature request to backport BF16Optimizer's fp32 grad accumulation to ZeRO-1,2,3.

Once this is done I can safely tell those who are interested in an easier to solution to use ZeRO.

@tjruwase, @jeffra

stas00 avatar Sep 23 '22 20:09 stas00

@tjruwase, would it be possible to implement this? We are ready to start using ZeRO-3/bf16 for the multi-modal training.

Thank you very much!

stas00 avatar Oct 10 '22 16:10 stas00

@stas00, is it better to close this or #2768? They are the same thing, right?

tjruwase avatar Jan 31 '23 18:01 tjruwase

Hi Tunji - you're the owner so it's up to you to decide. The new one is a duplicate of this one, so typically the earliest one stays.

And I don't agree with the other request that it should be hardcoded to fp32, it should be a user choice. Though most likely a sensible default should be fp32 for bf16 mixed precision training.

stas00 avatar Jan 31 '23 18:01 stas00

Makes sense, will close the newer one and reference this appropriately.

Yes, the accumulation type will be configurable. Hopefully, we should have a WIP pushed later this week. It would be great to get your usual feedback as we iterate on a solution.

tjruwase avatar Jan 31 '23 18:01 tjruwase

Fantastic news, Tunji. Thank you.

And, yes, we would be happy to experiment with your WIP PR.

stas00 avatar Jan 31 '23 18:01 stas00

Amazing, thanks!

michaelroyzen avatar Jan 31 '23 21:01 michaelroyzen

hi I also met the same problem. @tjruwase have you found a solution?

bestbzw avatar Feb 11 '23 06:02 bestbzw

Great, looking forward to see this new release!

danyang-rainbow avatar Feb 11 '23 15:02 danyang-rainbow

Any update on this @tjruwase?

michaelroyzen avatar Feb 24 '23 06:02 michaelroyzen

Please see: https://github.com/microsoft/DeepSpeed/pull/2847

stas00 avatar Feb 24 '23 06:02 stas00

Closing as completed by #2847.

tjruwase avatar Aug 10 '23 17:08 tjruwase