DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

When running Stage-3 scripts with enable_hybrid_engine encountered errors

Open DwarfWarriors opened this issue 1 year ago • 5 comments

I was using script from step3_rlhf_finetuning/training_scripts/single_node/run_6.7b.sh, I met some errors. I used 7B Llama models as actor and critic respectively and set enable_hybrid_engine argument, I got errors like below:

│ /root/miniconda3/envs/coati/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py:398 │ │ in step │ │ │ │ 395 │ │ │ 396 │ def step(self, lr_kwargs=None): │ │ 397 │ │ super().step(lr_kwargs=lr_kwargs) │ │ ❱ 398 │ │ if(self._inference_containers[0].module.attention.attn_qkvw is not None and \ │ │ 399 │ │ │ self._inference_containers[0].q_k_v is not None): │ │ 400 │ │ │ for inference_container in self._inference_containers: │ │ 401 │ │ │ │ inference_container.reset_qkv() │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ IndexError: list index out of range

How can I solve this issue? Thx : )

DwarfWarriors avatar Apr 20 '23 09:04 DwarfWarriors

same error when using GLM as actor.

YaguangGong avatar Apr 20 '23 11:04 YaguangGong

same error using LLaMA as an actor when zero stage = 0

l294265421 avatar Apr 23 '23 12:04 l294265421

Hi, Llama is not supported yet. Please stay tuned :)

yaozhewei avatar Apr 24 '23 04:04 yaozhewei

Looking forward!

aimetrics avatar Apr 24 '23 06:04 aimetrics

same error using LLaMA as an actor when zero stage = 0

When zero stage = 0, removing enable_hybrid_engine option works.

l294265421 avatar Apr 24 '23 06:04 l294265421

Looking forward!

qinqinqaq avatar Apr 27 '23 05:04 qinqinqaq

Looking forward!

mcc311 avatar Apr 30 '23 05:04 mcc311

same error using LLaMA as an actor when zero stage = 3

alphanlp avatar May 03 '23 14:05 alphanlp

same error using LLaMA as an actor when zero stage = 0

When zero stage = 0, removing enable_hybrid_engine option works.

i just try this way, but i find the training process is very slow!

alphanlp avatar May 03 '23 14:05 alphanlp