DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

fix the bug of save bf16 optimizer state in the bf16+zero1+pp mode

Open L-hongbin opened this issue 2 years ago • 6 comments

Fix the bug " AttributeError: 'BF16_Optimizer' object has no attribute 'bit16_groups' " when using bf16 + zero1 + pp to train model and saving the bf16 optimizer state

L-hongbin avatar Jun 16 '23 02:06 L-hongbin

I have also encountered this problem, "In the two configurations below, both will opt for the 'BF16optimizer' in the selection logic of the optimizer. Despite the differing 'config' settings, seems they follow the same execution path with bf16 + zero1 + pp. However, when employing the former configuration, self.zero_optimization() is set to 1, whereas in the latter configuration, it's set to 0." If attempt to save optimizer states using the first configuration, will encounter this issue.


"zero_optimization": { "stage": 1 }, "bf16": {"enabled": true}, "fp16": {"enabled": false}, "wall_clock_breakdown": false, "zero_allow_untested_optimizer": true, "data_types": { "grad_accum_dtype": "fp32" }


"zero_optimization": { "stage": 0 }, "bf16": {"enabled": true}, "fp16": {"enabled": false}, "wall_clock_breakdown": false

inkcherry avatar Aug 23 '23 07:08 inkcherry

@inkcherry, your analysis is correct. And yes, configuration 2 is misleading. But we use bf16_optimizer in this case for historical reasons from the bloom176b training and because we have not written a non-zero optimizer wrapper that works for bf16 training.

tjruwase avatar Aug 24 '23 01:08 tjruwase

I have the same error. I want to know that if I use this code to modify DeepSpeed now, Can it work correctly at BF16?

zte-tcb avatar Sep 15 '23 09:09 zte-tcb

I have the same error. I want to know that if I use this code to modify DeepSpeed now, Can it work correctly at BF16?

Yes, please try the suggested fix.

tjruwase avatar Sep 15 '23 12:09 tjruwase

@L-hongbin, are you still working on this PR?

tjruwase avatar Sep 15 '23 12:09 tjruwase

@inkcherry, are you still interested in this PR. It seems @L-hongbin is no longer interested, so I want to close. Thanks!

tjruwase avatar Aug 09 '24 10:08 tjruwase