I run the test program use "python train.py --actor-model facebook/opt-13b --reward-model facebook/opt-350m --num-gpus 8".The program can run normally.But I modified the parameter max_ answer_ seq_ len = 1024 and max_prompt_seq_len 1024 in run_1.3b.sh .The program reported an error.
**
_Time to load utils op: 0.00037217140197753906 seconds
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ /data/nfs/luojiangang/DeepSpeed/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/st │
│ ep3_rlhf_finetuning/main.py:516 in │
│ │
│ 513 │
│ 514 │
│ 515 if name == "main": │
│ ❱ 516 │ main() │
│ 517 │
│ │
│ /data/nfs/luojiangang/DeepSpeed/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/st │
│ ep3_rlhf_finetuning/main.py:425 in main │
│ │
│ 422 │ │ │ │ prompts = prompts[:, length - args.max_prompt_seq_len:] │
│ 423 │ │ │ │ raise ValueError("Prompt length is too long") │
│ 424 │ │ │ │
│ ❱ 425 │ │ │ out = trainer.generate_experience(prompts) │
│ 426 │ │ │ exp_dataset = exp_mini_dataset.add(out) │
│ 427 │ │ │ │
│ 428 │ │ │ if exp_dataset is not None: │
│ │
│ /data/nfs/luojiangang/DeepSpeed/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/st │
│ ep3_rlhf_finetuning/ppo_trainer.py:97 in generate_experience │
│ │
│ 94 │ │
│ 95 │ def generate_experience(self, prompts): │
│ 96 │ │ self.eval() │
│ ❱ 97 │ │ seq = self._generate_sequence(prompts) │
│ 98 │ │ self.train() │
│ 99 │ │ │
│ 100 │ │ pad_token_id = self.tokenizer.pad_token_id │
│ │
│ /data/nfs/luojiangang/DeepSpeed/DeepSpeedExamples-master/applications/DeepSpeed-Chat/training/st │
│ ep3_rlhf_finetuning/ppo_trainer.py:73 in _generate_sequence │
│ │
│ 70 │ │ max_min_length = self.max_answer_seq_len + prompts.shape[1] │
│ 71 │ │ │
│ 72 │ │ with torch.no_grad(): │
│ ❱ 73 │ │ │ seq = self.actor_model.module.generate(prompts, │
│ 74 │ │ │ │ │ │ │ │ │ │ │ │ max_length=max_min_length, │
│ 75 │ │ │ │ │ │ │ │ │ │ │ │ min_length=max_min_length) │
│ 76 │
/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py:258 in generate │
│ │
│ 255 │ │ │ │
│ 256 │ │ │ if len(self.all_lora_params) > 0: │
│ 257 │ │ │ │ if (not self.Z3_enabled): │
│ ❱ 258 │ │ │ │ │ self.unfuse_lora_weight() │
│ 259 │ │ │ │ else: │
│ 260 │ │ │ │ │ self.unfuse_lora_weight_non_pinned() │
│ 261 │ │ │ │ self.is_lora_fused = False │
│ │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py:144 in │
│ unfuse_lora_weight │
│ │
│ 141 │ │
│ 142 │ def unfuse_lora_weight(self): │
│ 143 │ │ for layer_id in range(len(self.layer_params)): │
│ ❱ 144 │ │ │ self._unfuse_lora(self.layer_params[layer_id], self.lora_params[layer_id]) │
│ 145 │ │
│ 146 │ def unfuse_lora_weight_non_pinned(self): │
│ 147 │ │ for layer_id in range(len(self.layer_params)): │
│ │
│ /opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py:140 in _unfuse_lora │
│ │
│ 137 │ │ │ │ lora_right_weight, \ │
│ 138 │ │ │ │ lora_left_weight, \ │
│ 139 │ │ │ │ lora_scaling = lora_param │
│ ❱ 140 │ │ │ │ weight.data -= lora_scaling * torch.matmul(lora_left_weight.t(), lora_ri │
│ 141 │ │
│ 142 │ def unfuse_lora_weight(self): │
│ 143 │ │ for layer_id in range(len(self.layer_params)): │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasGemmEx( handle, opa, opb, m, n, k, &falpha, a, CUDA_R_16F, lda, b, CUDA_R_16F, ldb, &fbeta, c, CUDA_R_16F, ldc, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP)
**
Is this a bug? Perhaps there are other ways for the program to support longer answer_seq and prompt_seq lengths?We look forward to your reply