DeepSpeed
DeepSpeed copied to clipboard
[REQUEST] ZeRO stage 3 support for mixture-of-experts (MoE) layer
Hello everyone,
I've always wanted to run large models using minimal GPUs, as I only have a few at my disposal. That is why I was impressed that ZeRO-3 can support the running of large models by offloading parameters from GPU memory to CPU memory.
However, I recently discovered that ZeRO-3 does not support the use of MoE models, which came as a shock to me. Personally, I believe that the MoE model is a well-known and effective way to increase model capacity. Therefore, I think it would make sense for ZeRO to support the MoE model.
I'm wondering if it's true that ZeRO-3 does not support MoE model inference?
File "/root/deepspeed/ds/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1291, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/root/deepspeed/ds/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1575, in _configure_zero_optimizer
assert not self.has_moe_layers, "MoE not supported with Stage 3"
AssertionError: MoE not supported with Stage
The thing that makes me feel weird is that Hugging Face integrated DeepSpeed support MoE layer for ZeRO-3...
Hi @ranggihwang -- thank you for your interest in DeepSpeed and ZeRO-3.
We do have some rationale for why we only support ZeRO-Stage-2 and MoE together. I am happy to hear more about the model scale you are experimenting with but based on our experiments, we were able to train extremely large models already and as such did not need to go to ZeRO-3.
Also, thanks for pointing out HF support. Can you please point me to that and how you are using that?
I am the lead developer/support-person on the MoE line and I am happy to help you get the largest MoE model trained with DeepSpeed.
Hi, @awan-10
I just want to run inference on both models, Google/switch-transformer-XXL and Google/switch-transformer-c. The capacity of these models is more than 500GB.
Furthermore, to run smaller models, I think we need ZeRO-3 for MoE. With ZeRO-3, for example, we can run a 100GB MoE model with only one V100 GPU 32GB. Since not everyone has multiple GPUs, and it can also be helpful to minimize TCO, supporting ZeRO-3 for MoE would be great.
By the way, what is the rationale for supporting only ZeRO-2 for MoE?
I have the same question. I'm trying to use ZeRO stage 3 offload to NVMe to run a large MoE model. If "MoE not supported with Stage 3", it means we can't use ZeRO offload on MoE models. Right?
Hi @ranggihwang -- thank you for your interest in DeepSpeed and ZeRO-3.
We do have some rationale for why we only support ZeRO-Stage-2 and MoE together. I am happy to hear more about the model scale you are experimenting with but based on our experiments, we were able to train extremely large models already and as such did not need to go to ZeRO-3.
Also, thanks for pointing out HF support. Can you please point me to that and how you are using that?
I am the lead developer/support-person on the MoE line and I am happy to help you get the largest MoE model trained with DeepSpeed.
Can you tell me the maximum MoE model size with ZeRO-2?When I trained MoE model with total 14B params, I met Out of Memory. Or am I missing something about memory optimization?