DeepSpeedExamples icon indicating copy to clipboard operation
DeepSpeedExamples copied to clipboard

"RuntimeError: The size of tensor a (5120) must match the size of tensor b (20480) at non-singleton dimension 0" in step3

Open oolongoo opened this issue 1 year ago • 5 comments

I have successfully run step 1 and step 2 and generated the models, but encountered an error when running step 3: "RuntimeError: The size of tensor a (5120) must match the size of tensor b (20480) at non-singleton dimension 0"

DeepSpeed 0.10.0 Cuda 11.7 pytorch 1.13.1

run with 4 * A10 24G

run script:

# python train.py --step 3 --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node
bash /mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_node/run_13b.sh /mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/13b /mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m '' '' /mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/output/step3-models/13b

run_13b.sh:

#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

# DeepSpeed Team
ACTOR_MODEL_PATH=$1
CRITIC_MODEL_PATH=$2
ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
    OUTPUT=./output
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
    ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
    CRITIC_ZERO_STAGE=3
fi
mkdir -p $OUTPUT

Num_Padding_at_Beginning=1 # this is model related

Actor_Lr=5e-4
Critic_Lr=5e-6

deepspeed --master_port 12346 main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 2 \
   --per_device_mini_train_batch_size 2 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --inference_tp_size 2 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --actor_gradient_checkpointing \
   --disable_actor_dropout \
   --actor_lora_dim 128 \
   --actor_lora_module_name decoder.layers. \
   --output_dir $OUTPUT \
    &> $OUTPUT/training.log

error log:

192.168.1.51: *****************[end] Initialized Reward Model [end] (duration: 10.30s)******************
192.168.1.51: ***** Running training *****
192.168.1.51: Beginning of Epoch 1/1, Total Generation Batches 954
192.168.1.54: Traceback (most recent call last):
192.168.1.54:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in <module>
192.168.1.54:     main()
192.168.1.54:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 429, in main
192.168.1.54:     out = trainer.generate_experience(batch_prompt['prompt'],
192.168.1.54:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 98, in generate_experience
192.168.1.54:     seq = self._generate_sequence(prompts, mask)
192.168.1.54:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
192.168.1.54:     seq = self.actor_model.module.generate(prompts,
192.168.1.54:   File "/opt/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 207, in generate
192.168.1.51: Traceback (most recent call last):
192.168.1.51:   File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in <module>
192.168.1.54:     self._fuse_lora(self.layer_params[layer_id], self.lora_params[layer_id])
192.168.1.54:   File "/opt/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 137, in _fuse_lora
192.168.1.54:     weight.data += lora_scaling * torch.matmul(lora_left_weight.t(), lora_right_weight.t())
192.168.1.54: RuntimeError: The size of tensor a (5120) must match the size of tensor b (20480) at non-singleton dimension 0

oolongoo avatar Jul 03 '23 03:07 oolongoo