DeepSpeed fix the bug of save bf16 optimizer state in the bf16+zero1+pp mode

fix the bug of save bf16 optimizer state in the bf16+zero1+pp mode

Open L-hongbin opened this issue 2 years ago • 6 comments

Fix the bug " AttributeError: 'BF16_Optimizer' object has no attribute 'bit16_groups' " when using bf16 + zero1 + pp to train model and saving the bf16 optimizer state

Jun 16 '23 02:06 L-hongbin

I have also encountered this problem， "In the two configurations below, both will opt for the 'BF16optimizer' in the selection logic of the optimizer. Despite the differing 'config' settings, seems they follow the same execution path with bf16 + zero1 + pp. However, when employing the former configuration, self.zero_optimization() is set to 1, whereas in the latter configuration, it's set to 0." If attempt to save optimizer states using the first configuration, will encounter this issue.

"zero_optimization": { "stage": 1 }, "bf16": {"enabled": true}, "fp16": {"enabled": false}, "wall_clock_breakdown": false, "zero_allow_untested_optimizer": true, "data_types": { "grad_accum_dtype": "fp32" }

"zero_optimization": { "stage": 0 }, "bf16": {"enabled": true}, "fp16": {"enabled": false}, "wall_clock_breakdown": false

Aug 23 '23 07:08 inkcherry

@inkcherry, your analysis is correct. And yes, configuration 2 is misleading. But we use bf16_optimizer in this case for historical reasons from the bloom176b training and because we have not written a non-zero optimizer wrapper that works for bf16 training.

Aug 24 '23 01:08 tjruwase

I have the same error. I want to know that if I use this code to modify DeepSpeed now, Can it work correctly at BF16?

Sep 15 '23 09:09 zte-tcb

I have the same error. I want to know that if I use this code to modify DeepSpeed now, Can it work correctly at BF16?

Yes, please try the suggested fix.

Sep 15 '23 12:09 tjruwase

@L-hongbin, are you still working on this PR?

Sep 15 '23 12:09 tjruwase

@inkcherry, are you still interested in this PR. It seems @L-hongbin is no longer interested, so I want to close. Thanks!

Aug 09 '24 10:08 tjruwase

DeepSpeed DeepSpeed copied to clipboard

fix the bug of save bf16 optimizer state in the bf16+zero1+pp mode

"zero_optimization": { "stage": 1 }, "bf16": {"enabled": true}, "fp16": {"enabled": false}, "wall_clock_breakdown": false, "zero_allow_untested_optimizer": true, "data_types": { "grad_accum_dtype": "fp32" }

"zero_optimization": { "stage": 0 }, "bf16": {"enabled": true}, "fp16": {"enabled": false}, "wall_clock_breakdown": false

DeepSpeed
DeepSpeed copied to clipboard