Megatron-LM [ENHANCEMENT] support zero 2 distributed optimize

[ENHANCEMENT] support zero 2 distributed optimize

Open hwdef opened this issue 1 year ago • 4 comments

Is your feature request related to a problem? Please describe. As far as I know, the current distributed optimizer of megatron-lm implements zero1, but zero1 does not save enough GPU memory. When I train a model, I use the --use-distributed-optimizer parameter to only make each GPU memory usage has dropped from 49GB to 47GB. I conducted the same test on megatron-deepspeed. When megatron-deepspeed doubled the MBS, the GPU memory usage was only 36GB. In megatron-deepspeed, I used zero2. I think The reduction of GPU memory is related to my use of zero2.

Describe the solution you'd like

I hope megatron-lm can also implement zero2

Describe alternatives you've considered

https://github.com/microsoft/Megatron-DeepSpeed