Megatron-LM icon indicating copy to clipboard operation
Megatron-LM copied to clipboard

[ENHANCEMENT] support zero 2 distributed optimize

Open hwdef opened this issue 1 year ago • 4 comments

Is your feature request related to a problem? Please describe. As far as I know, the current distributed optimizer of megatron-lm implements zero1, but zero1 does not save enough GPU memory. When I train a model, I use the --use-distributed-optimizer parameter to only make each GPU memory usage has dropped from 49GB to 47GB. I conducted the same test on megatron-deepspeed. When megatron-deepspeed doubled the MBS, the GPU memory usage was only 36GB. In megatron-deepspeed, I used zero2. I think The reduction of GPU memory is related to my use of zero2.

Describe the solution you'd like

I hope megatron-lm can also implement zero2

Describe alternatives you've considered

https://github.com/microsoft/Megatron-DeepSpeed

hwdef avatar Nov 14 '23 06:11 hwdef

@lmcafee-nvidia Hi, Could you please help me about this?

hwdef avatar Nov 14 '23 06:11 hwdef

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Jan 13 '24 18:01 github-actions[bot]

Any progress?

MoFHeka avatar Feb 20 '24 12:02 MoFHeka

Marking as stale. No activity in 60 days.

github-actions[bot] avatar Apr 21 '24 18:04 github-actions[bot]

@shanmugamr1992 can you please let us know how to enable ZeRO 1/2/3 ? Thanks

polisettyvarma avatar Sep 18 '24 11:09 polisettyvarma

@shanmugamr1992 please let know how can i enable ZeRO 1/2/3 feature ? raised #1156

polisettyvarma avatar Sep 24 '24 07:09 polisettyvarma

We do not currently support Zero 2/3. But it is possible that we will support this in the future.

lmcafee-nvidia avatar Sep 24 '24 16:09 lmcafee-nvidia