[BUG] deepspeed offload states is not compatible with ZeRO2
In OpenRLHF, we want to offload/reload the states of the Adam optimizer, and at the same time use auto tensor parallelism for RL training. However, I found that Zero2 is not compatible with offloading/reloading optimizer states. Could you please provide support for this? @loadams
Tagging @tjruwase for FYI as well
Hi @hijkzzz,
Since some parts will be shared with ZeRO-3, could you help us add support for ZeRO-2 as well? The entry points are already defined in the engine: offload, reload. The actual logic currently exists only in the ZeRO-3 optimizer. What we need is to remove the assertion checking the ZeRO stage and add offload_states and reload_states in ZeRO1/ZeRO2 optimizer.
You can refer to the ZeRO-3 implementations of offload_states and reload_states in the ZeRO3 optimizer. The offloading/reloading of Adam will be similar.
If you can draft a PR, we'd be happy to review it!