"RuntimeError: The size of tensor a (5120) must match the size of tensor b (20480) at non-singleton dimension 0" in step3
I have successfully run step 1 and step 2 and generated the models, but encountered an error when running step 3: "RuntimeError: The size of tensor a (5120) must match the size of tensor b (20480) at non-singleton dimension 0"
DeepSpeed 0.10.0 Cuda 11.7 pytorch 1.13.1
run with 4 * A10 24G
run script:
# python train.py --step 3 --actor-model facebook/opt-13b --reward-model facebook/opt-350m --deployment-type single_node
bash /mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/training_scripts/single_node/run_13b.sh /mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/output/actor-models/13b /mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/output/reward-models/350m '' '' /mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/output/step3-models/13b
run_13b.sh:
#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0
# DeepSpeed Team
ACTOR_MODEL_PATH=$1
CRITIC_MODEL_PATH=$2
ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
OUTPUT=./output
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
CRITIC_ZERO_STAGE=3
fi
mkdir -p $OUTPUT
Num_Padding_at_Beginning=1 # this is model related
Actor_Lr=5e-4
Critic_Lr=5e-6
deepspeed --master_port 12346 main.py \
--data_path Dahoas/rm-static \
--data_split 2,4,4 \
--actor_model_name_or_path $ACTOR_MODEL_PATH \
--critic_model_name_or_path $CRITIC_MODEL_PATH \
--num_padding_at_beginning 1 \
--per_device_train_batch_size 2 \
--per_device_mini_train_batch_size 2 \
--generation_batch_numbers 1 \
--ppo_epochs 1 \
--max_answer_seq_len 256 \
--max_prompt_seq_len 256 \
--actor_learning_rate ${Actor_Lr} \
--critic_learning_rate ${Critic_Lr} \
--num_train_epochs 1 \
--lr_scheduler_type cosine \
--gradient_accumulation_steps 1 \
--num_warmup_steps 100 \
--deepspeed --seed 1234 \
--enable_hybrid_engine \
--inference_tp_size 2 \
--actor_zero_stage $ACTOR_ZERO_STAGE \
--critic_zero_stage $CRITIC_ZERO_STAGE \
--actor_gradient_checkpointing \
--disable_actor_dropout \
--actor_lora_dim 128 \
--actor_lora_module_name decoder.layers. \
--output_dir $OUTPUT \
&> $OUTPUT/training.log
error log:
192.168.1.51: *****************[end] Initialized Reward Model [end] (duration: 10.30s)******************
192.168.1.51: ***** Running training *****
192.168.1.51: Beginning of Epoch 1/1, Total Generation Batches 954
192.168.1.54: Traceback (most recent call last):
192.168.1.54: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in <module>
192.168.1.54: main()
192.168.1.54: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 429, in main
192.168.1.54: out = trainer.generate_experience(batch_prompt['prompt'],
192.168.1.54: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 98, in generate_experience
192.168.1.54: seq = self._generate_sequence(prompts, mask)
192.168.1.54: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
192.168.1.54: seq = self.actor_model.module.generate(prompts,
192.168.1.54: File "/opt/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 207, in generate
192.168.1.51: Traceback (most recent call last):
192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in <module>
192.168.1.54: self._fuse_lora(self.layer_params[layer_id], self.lora_params[layer_id])
192.168.1.54: File "/opt/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 137, in _fuse_lora
192.168.1.54: weight.data += lora_scaling * torch.matmul(lora_left_weight.t(), lora_right_weight.t())
192.168.1.54: RuntimeError: The size of tensor a (5120) must match the size of tensor b (20480) at non-singleton dimension 0
@oolongoo -- can you please update to the latest DeepSpeedExamples and DeepSpeed and try again? Some LoRA-related fixes have been merged today (https://github.com/microsoft/DeepSpeed/pull/3563) so please try and let us know.
get a new error with newest master:
192.168.1.51: Traceback (most recent call last):
192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in <module>
192.168.1.51: main()
192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 429, in main
192.168.1.51: out = trainer.generate_experience(prompts,
192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 101, in generate_experience
192.168.1.51: seq = self._generate_sequence(prompts, mask)
192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence
192.168.1.51: seq = self.actor_model.module.generate(
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 234, in generate
192.168.1.51: generate_ret_vals = self._generate(*inputs, **kwargs)
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
192.168.1.51: return func(*args, **kwargs)
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate
192.168.1.51: return self.greedy_search(
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2349, in greedy_search
192.168.1.51: outputs = self(
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
192.168.1.51: result = forward_call(*input, **kwargs)
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward
192.168.1.51: outputs = self.model.decoder(
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl
192.168.1.51: result = forward_call(*input, **kwargs)
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 650, in forward
192.168.1.51: causal_attention_mask = self._prepare_decoder_attention_mask(
192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 551, in _prepare_decoder_attention_mask
192.168.1.51: expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
192.168.1.51: RuntimeError: The size of tensor a (4) must match the size of tensor b (16) at non-singleton dimension 0
get a new error with newest master:
192.168.1.51: Traceback (most recent call last): 192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 521, in <module> 192.168.1.51: main() 192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py", line 429, in main 192.168.1.51: out = trainer.generate_experience(prompts, 192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 101, in generate_experience 192.168.1.51: seq = self._generate_sequence(prompts, mask) 192.168.1.51: File "/mnt/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step3_rlhf_finetuning/ppo_trainer.py", line 73, in _generate_sequence 192.168.1.51: seq = self.actor_model.module.generate( 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/deepspeed/runtime/hybrid_engine.py", line 234, in generate 192.168.1.51: generate_ret_vals = self._generate(*inputs, **kwargs) 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context 192.168.1.51: return func(*args, **kwargs) 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 1527, in generate 192.168.1.51: return self.greedy_search( 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/generation/utils.py", line 2349, in greedy_search 192.168.1.51: outputs = self( 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl 192.168.1.51: result = forward_call(*input, **kwargs) 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 944, in forward 192.168.1.51: outputs = self.model.decoder( 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1212, in _call_impl 192.168.1.51: result = forward_call(*input, **kwargs) 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 650, in forward 192.168.1.51: causal_attention_mask = self._prepare_decoder_attention_mask( 192.168.1.51: File "/opt/anaconda3/lib/python3.10/site-packages/transformers/models/opt/modeling_opt.py", line 551, in _prepare_decoder_attention_mask 192.168.1.51: expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask 192.168.1.51: RuntimeError: The size of tensor a (4) must match the size of tensor b (16) at non-singleton dimension 0
Got the same error.
Similar... 6144 and 8192
Same to me. It seems a tp related bug. It works fine when not enabling tp.