DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

gpt ppo training error

Open lljjgg opened this issue 1 year ago • 6 comments

I trained the PPO model, use the gpt. I modified the option of model_name_or_path from opt to gpt2 I passed step 1 and step 2,But An error occurred in step 3.The error is as follows: ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /data/luojiangang/423_Deep/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_fin │ │ etuning/main.py:522 in │ │ │ │ 519 │ │ 520 │ │ 521 if name == "main": │ │ ❱ 522 │ main() │ │ 523 │ │ │ │ /data/luojiangang/423_Deep/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_fin │ │ etuning/main.py:431 in main │ │ │ │ 428 │ │ │ │ prompts = prompts[:, length - args.max_prompt_seq_len:] │ │ 429 │ │ │ │ raise ValueError("Prompt length is too long") │ │ 430 │ │ │ │ │ ❱ 431 │ │ │ out = trainer.generate_experience(prompts) │ │ 432 │ │ │ exp_dataset = exp_mini_dataset.add(out) │ │ 433 │ │ │ │ │ 434 │ │ │ if exp_dataset is not None: │ │ │ │ /data/luojiangang/423_Deep/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_fin │ │ etuning/ppo_trainer.py:97 in generate_experience │ │ │ │ 94 │ │ │ 95 │ def generate_experience(self, prompts): │ │ 96 │ │ self.eval() │ │ ❱ 97 │ │ seq = self._generate_sequence(prompts) │ │ 98 │ │ self.train() │ │ 99 │ │ │ │ 100 │ │ pad_token_id = self.tokenizer.pad_token_id │ │ │ │ /data/luojiangang/423_Deep/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_fin │ │ etuning/ppo_trainer.py:91 in _generate_sequence │ │ │ │ 88 │ │ │ │ continue │ │ 89 │ │ │ else: │ │ 90 │ │ │ │ out_seq.append(seq[i:i + 1]) │ │ ❱ 91 │ │ out_seq = torch.cat(out_seq, dim=0) # concate output in the batch dim │ │ 92 │ │ │ │ 93 │ │ return out_seq │ │ 94 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: torch.cat(): expected a non-empty list of Tensors torch.Size([4, 50264]) torch.Size([4, 50264]) !!!! kernel execution error. (m: 2048, n: 4, k: 2048, error: 14) !!!! kernel execution error. (m: 8192, n: 4, k: 2048, error: 13) !!!! kernel execution error. (m: 2048, n: 4, k: 2048, error: 13)

Do you know what causes this? Can you provide the training steps for gpt2.Looking forward to your reply

lljjgg avatar Apr 26 '23 08:04 lljjgg