OpenThinkIMG icon indicating copy to clipboard operation
OpenThinkIMG copied to clipboard

deepspeed_engine_wrapped Initialization Failure in ZeRO-3 Training​

Open zandfj opened this issue 7 months ago • 3 comments

Hello! First, thank you for developing and maintaining this exceptional project . I'm encountering a distributed training issue related to DeepSpeed ZeRO-3 and would appreciate your guidance.

AttributeError: 'NoneType' object has no attribute 'backward'
File "/opt/conda/envs/tg_llava/lib/python3.10/site-packages/accelerate/accelerator.py", line 2238, in backward
    self.deepspeed_engine_wrapped.backward(loss, ​**kwargs)

Could you please share a validated ZeRO-3 configuration template?

zandfj avatar Jun 03 '25 07:06 zandfj

Hi, thanks for your interest!

We don’t use DeepSpeed features during the RL procedure, as they may conflict with VLLM-based rollouts. However, we provide a Zero-3 config for TF-EVAL inference purposes here.

ssmisya avatar Jun 05 '25 12:06 ssmisya

Thank you for your detailed previous response regarding the OpenThinkIMG framework. To better reproduce the experiments in our environment, could you kindly share the hardware configurations and approximate runtimes for both the Cold-start Period and the V-ToolRL Period? I look forward to your response.

zandfj avatar Jun 11 '25 11:06 zandfj

I encountered the same issue. Without DeepSpeed, I run into OOM errors. Have you found a solution?

GaoXiaoshan avatar Nov 25 '25 02:11 GaoXiaoshan