DeepSpeedExamples [DeepSpeedExamples/applications/DeepSpeed-Chat/] Error happened when running step3_rlhf_finetuning in enable_hybrid_engine mode with togethercomputer/GPT-NeoXT-Chat-Base-20B

[DeepSpeedExamples/applications/DeepSpeed-Chat/] Error happened when running step3_rlhf_finetuning in enable_hybrid_engine mode with togethercomputer/GPT-NeoXT-Chat-Base-20B

Open GxjGit opened this issue 2 years ago • 2 comments

Error info:

File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 99, in new_inference_container File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/containers/gptneox.py", line 95, in get_hidden_heads IndexError _container.create_ds_model_config() File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/containers/base.py", line 79, in create_ds_model_config : tuple index out of range return self.client_module.attention.query_key_value.weight.shape[1],
self.set_hidden_heads(*self.policy.get_hidden_heads()) File "/opt/conda/lib/python3.8/site-packages/deepspeed/module_inject/containers/gptneox.py", line 95, in get_hidden_heads *IndexError return self.client_module.attention.query_key_value.weight.shape[1], * : tuple index out of range IndexError: tuple index out of range

I have printed "self.client_module.attention.query_key_value.weight.shape", the result is torch.Size([0]).

I wonder if DeepSpeed-Chat has supported togethercomputer/GPT-NeoXT-Chat-Base-20B with --enable_hybrid_engine.

My running script is:

#!/bin/bash
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0

ACTOR_MODEL_PATH="togethercomputer/GPT-NeoXT-Chat-Base-20B"
CRITIC_MODEL_PATH="togethercomputer/GPT-NeoXT-Chat-Base-20B"

ACTOR_ZERO_STAGE=$3
CRITIC_ZERO_STAGE=$4
OUTPUT=$5
if [ "$OUTPUT" == "" ]; then
    OUTPUT=/home/notebook/data/personal/deepspeed-llama/RLHF
fi
if [ "$ACTOR_ZERO_STAGE" == "" ]; then
    ACTOR_ZERO_STAGE=3
fi
if [ "$CRITIC_ZERO_STAGE" == "" ]; then
    CRITIC_ZERO_STAGE=3
fi
mkdir -p $OUTPUT

Num_Padding_at_Beginning=1 # this is model related

Actor_Lr=9.65e-6
Critic_Lr=5e-6

python -m torch.distributed.launch --nproc_per_node=8 /home/notebook/data/personal/80350607/0472/code/dev/llama/star-acc/StarEngine/nlp/DeepSpeed-Chat/training/step3_rlhf_finetuning/main.py \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 4 \
   --per_device_mini_train_batch_size 4 \
   --generation_batch_numbers 1 \
   --inference_tp_size 1 \
   --tp_gather_partition_size 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --output_dir $OUTPUT \

Apr 27 '23 09:04 GxjGit

I also encounter this issue. Is there an easy fix? Thanks.

May 05 '23 13:05 leuchine

anything new？

Nov 30 '23 17:11 Ricardokevins

DeepSpeedExamples DeepSpeedExamples copied to clipboard

[DeepSpeedExamples/applications/DeepSpeed-Chat/] Error happened when running step3_rlhf_finetuning in enable_hybrid_engine mode with togethercomputer/GPT-NeoXT-Chat-Base-20B

DeepSpeedExamples
DeepSpeedExamples copied to clipboard