DeepSpeedExamples
DeepSpeedExamples copied to clipboard
When running Stage-3 scripts with enable_hybrid_engine encountered errors
I was using script from step3_rlhf_finetuning/training_scripts/single_node/run_6.7b.sh, I met some errors. I used 7B Llama models as actor and critic respectively and set enable_hybrid_engine argument, I got errors like below:
│ /root/miniconda3/envs/coati/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py:398 │ │ in step │ │ │ │ 395 │ │ │ 396 │ def step(self, lr_kwargs=None): │ │ 397 │ │ super().step(lr_kwargs=lr_kwargs) │ │ ❱ 398 │ │ if(self._inference_containers[0].module.attention.attn_qkvw is not None and \ │ │ 399 │ │ │ self._inference_containers[0].q_k_v is not None): │ │ 400 │ │ │ for inference_container in self._inference_containers: │ │ 401 │ │ │ │ inference_container.reset_qkv() │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ IndexError: list index out of range
How can I solve this issue? Thx : )
same error when using GLM as actor.
same error using LLaMA as an actor when zero stage = 0
Hi, Llama is not supported yet. Please stay tuned :)
Looking forward!
same error using LLaMA as an actor when zero stage = 0
When zero stage = 0, removing enable_hybrid_engine option works.
Looking forward!
Looking forward!
same error using LLaMA as an actor when zero stage = 3
same error using LLaMA as an actor when zero stage = 0
When zero stage = 0, removing enable_hybrid_engine option works.
i just try this way, but i find the training process is very slow!