DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] deepspeed offload states is not compatible with ZeRO2

Open hijkzzz opened this issue 8 months ago • 2 comments

In OpenRLHF, we want to offload/reload the states of the Adam optimizer, and at the same time use auto tensor parallelism for RL training. However, I found that Zero2 is not compatible with offloading/reloading optimizer states. Could you please provide support for this? @loadams

hijkzzz avatar Apr 25 '25 08:04 hijkzzz

Tagging @tjruwase for FYI as well

loadams avatar May 20 '25 16:05 loadams

Hi @hijkzzz, Since some parts will be shared with ZeRO-3, could you help us add support for ZeRO-2 as well? The entry points are already defined in the engine: offload, reload. The actual logic currently exists only in the ZeRO-3 optimizer. What we need is to remove the assertion checking the ZeRO stage and add offload_states and reload_states in ZeRO1/ZeRO2 optimizer.

You can refer to the ZeRO-3 implementations of offload_states and reload_states in the ZeRO3 optimizer. The offloading/reloading of Adam will be similar.

If you can draft a PR, we'd be happy to review it!

tohtana avatar May 20 '25 19:05 tohtana