DeepSpeedExamples
DeepSpeedExamples copied to clipboard
[DeepSpeedExamples/applications/DeepSpeed-Chat/] Error happened when running step3_rlhf_finetuning in enable_hybrid_engine mode with togethercomputer/GPT-NeoXT-Chat-Base-20B
Error info:
File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 99, in new_inference_container
File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/containers/gptneox.py", line 95, in get_hidden_heads
IndexError _container.create_ds_model_config()
File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/containers/base.py", line 79, in create_ds_model_config
: tuple index out of range
return self.client_module.attention.query_key_value.weight.shape[1],
self.set_hidden_heads(*self.policy.get_hidden_heads())
File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/containers/gptneox.py", line 95, in get_hidden_heads
*IndexError return self.client_module.attention.query_key_value.weight.shape[1], *
: tuple index out of range
IndexError: tuple index out of range
I have printed "self.client_module.attention.query_key_value.weight.shape", the result is torch.Size([0]).
I wonder if DeepSpeed-Chat has supported togethercomputer/GPT-NeoXT-Chat-Base-20B with --enable_hybrid_engine.
My running script is:
#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0
ACTOR_MODEL_PATH="togethercomputer/GPT-NeoXT-Chat-Base-20B"
CRITIC_MODEL_PATH="togethercomputer/GPT-NeoXT-Chat-Base-20B"
ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
OUTPUT=/home/notebook/data/personal/deepspeed-llama/RLHF
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
CRITIC_ZERO_STAGE=3
fi
mkdir -p $OUTPUT
Num_Padding_at_Beginning=1 # this is model related
Actor_Lr=9.65e-6
Critic_Lr=5e-6
python -m torch.distributed.launch --nproc_per_node=8 /home/notebook/data/personal/80350607/0472/code/dev/llama/star-acc/StarEngine/nlp/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py \
--data_split 2,4,4 \
--actor_model_name_or_path $ACTOR_MODEL_PATH \
--critic_model_name_or_path $CRITIC_MODEL_PATH \
--num_padding_at_beginning 1 \
--per_device_train_batch_size 4 \
--per_device_mini_train_batch_size 4 \
--generation_batch_numbers 1 \
--inference_tp_size 1 \
--tp_gather_partition_size 1 \
--ppo_epochs 1 \
--max_answer_seq_len 256 \
--max_prompt_seq_len 256 \
--actor_learning_rate ${Actor_Lr} \
--critic_learning_rate ${Critic_Lr} \
--actor_weight_decay 0.1 \
--critic_weight_decay 0.1 \
--num_train_epochs 1 \
--lr_scheduler_type cosine \
--gradient_accumulation_steps 1 \
--num_warmup_steps 100 \
--deepspeed --seed 1234 \
--enable_hybrid_engine \
--actor_zero_stage $ACTOR_ZERO_STAGE \
--critic_zero_stage $CRITIC_ZERO_STAGE \
--output_dir $OUTPUT \
I also encounter this issue. Is there an easy fix? Thanks.
anything new?