DeepSpeed
DeepSpeed copied to clipboard
fix the bug of save bf16 optimizer state in the bf16+zero1+pp mode
Fix the bug " AttributeError: 'BF16_Optimizer' object has no attribute 'bit16_groups' " when using bf16 + zero1 + pp to train model and saving the bf16 optimizer state
I have also encountered this problem, "In the two configurations below, both will opt for the 'BF16optimizer' in the selection logic of the optimizer. Despite the differing 'config' settings, seems they follow the same execution path with bf16 + zero1 + pp. However, when employing the former configuration, self.zero_optimization() is set to 1, whereas in the latter configuration, it's set to 0." If attempt to save optimizer states using the first configuration, will encounter this issue.
"zero_optimization": { "stage": 1 }, "bf16": {"enabled": true}, "fp16": {"enabled": false}, "wall_clock_breakdown": false, "zero_allow_untested_optimizer": true, "data_types": { "grad_accum_dtype": "fp32" }
"zero_optimization": { "stage": 0 }, "bf16": {"enabled": true}, "fp16": {"enabled": false}, "wall_clock_breakdown": false
@inkcherry, your analysis is correct. And yes, configuration 2 is misleading. But we use bf16_optimizer in this case for historical reasons from the bloom176b training and because we have not written a non-zero optimizer wrapper that works for bf16 training.
I have the same error. I want to know that if I use this code to modify DeepSpeed now, Can it work correctly at BF16?
I have the same error. I want to know that if I use this code to modify DeepSpeed now, Can it work correctly at BF16?
Yes, please try the suggested fix.
@L-hongbin, are you still working on this PR?
@inkcherry, are you still interested in this PR. It seems @L-hongbin is no longer interested, so I want to close. Thanks!