deepspeed_engine_wrapped Initialization Failure in ZeRO-3 Training
Hello! First, thank you for developing and maintaining this exceptional project . I'm encountering a distributed training issue related to DeepSpeed ZeRO-3 and would appreciate your guidance.
AttributeError: 'NoneType' object has no attribute 'backward'
File "/opt/conda/envs/tg_llava/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
self.deepspeed_engine_wrapped.backward(loss, **kwargs)
Could you please share a validated ZeRO-3 configuration template?
Hi, thanks for your interest!
We don’t use DeepSpeed features during the RL procedure, as they may conflict with VLLM-based rollouts. However, we provide a Zero-3 config for TF-EVAL inference purposes here.
Thank you for your detailed previous response regarding the OpenThinkIMG framework. To better reproduce the experiments in our environment, could you kindly share the hardware configurations and approximate runtimes for both the Cold-start Period and the V-ToolRL Period? I look forward to your response.
I encountered the same issue. Without DeepSpeed, I run into OOM errors. Have you found a solution?