DeepSpeedExamples
DeepSpeedExamples copied to clipboard
step1-sft use lora failed
env
gpu: 4*A100 80G
pytorch: 1.13.1
cuda version: 11.7
deepspeed: 0.9.0
transformers: 4.28.0.dev
run script
OUTPUT=$1
ZERO_STAGE=3
if [ "$OUTPUT" == "" ]; then
OUTPUT=./output
fi
if [ "$ZERO_STAGE" == "" ]; then
ZERO_STAGE=3
fi
mkdir -p $OUTPUT
deepspeed main.py \
--data_path path/to/local/data \
--model_name_or_path path/to/codegen-16B-multi \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--max_seq_len 2048 \
--learning_rate 1e-4 \
--weight_decay 0.1 \
--num_train_epochs 5 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--num_warmup_steps 0 \
--seed 1234 \
--only_optimize_lora \
--zero_stage $ZERO_STAGE \
--lora_dim 128 \
--lora_module_name decoder.layers. \
--deepspeed \
--output_dir $OUTPUT
error message
Traceback (most recent call last):
File "/mnt/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 328, in <module>
main()
File "/mnt/data/DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/main.py", line 273, in main
model, optimizer, _, lr_scheduler = deepspeed.initialize(
File "/mnt/data/anaconda3/envs/ds-chat/lib/python3.9/site-packages/deepspeed/__init__.py", line 156, in initialize
engine = DeepSpeedEngine(args=args,
File "/mnt/data/anaconda3/envs/ds-chat/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 328, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/mnt/data/anaconda3/envs/ds-chat/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1187, in _configure_optimizer
self.optimizer = self._configure_zero_optimizer(basic_optimizer)
File "/mnt/data/anaconda3/envs/ds-chat/lib/python3.9/site-packages/deepspeed/runtime/engine.py", line 1465, in _configure_zero_optimizer
optimizer = DeepSpeedZeroOptimizer_Stage3(
File "/mnt/data/anaconda3/envs/ds-chat/lib/python3.9/site-packages/deepspeed/runtime/zero/stage3.py", line 133, in __init__
self.dtype = self.optimizer.param_groups[0]['params'][0].dtype
IndexError: list index out of range
Same Error
gpu: 8*A100 40G
pytorch: 2.0.0
cuda version: 11.7
deepspeed: 0.9.0+0b5252b
transformers: 4.28.0.dev
deepspeed main.py \
--data_path BelleGroup/train_1M_CN \
--model_name_or_path gpt-neox-20b/ \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--max_seq_len 512 \
--learning_rate 9.65e-5 \
--weight_decay 0.1 \
--num_train_epochs 2 \
--gradient_accumulation_steps 1 \
--lr_scheduler_type cosine \
--num_warmup_steps 0 \
--seed 1234 \
--lora_dim 128 \
--only_optimize_lora \
--zero_stage 3 \
--deepspeed \
--output_dir $OUTPUT_PATH
同样的错误
I got the same error
The script worked for opt models but do not work for other models. I guess it has something to do with the model format.
I found the solution. Basically you have to change --lora_module_name decoder.layers.
to the appropriate name for you model, for example, --lora_module_name h.
for bloom and gpt-neo.
I found the solution. Basically you have to change
--lora_module_name decoder.layers.
to the appropriate name for you model, for example,--lora_module_name h.
for bloom and gpt-neo.
Thanks for your suggestion! Do you know what is the lora_module_name for llama model?
Thank you @puyuanOT :). Yes, the LoRA replacement is based on the model arch (or the name)
I found the solution. Basically you have to change
--lora_module_name decoder.layers.
to the appropriate name for you model, for example,--lora_module_name h.
for bloom and gpt-neo.Thanks for your suggestion! Do you know what is the lora_module_name for llama model?
你可以执行这行代码
from transformers import AutoModel
model = AutoModel.from_pretrained("llama-7b-zpn")
for name, module in model.named_modules():
print(name)
可以看到 layer.开头,--lora_module_name layers.这样写就可以了
I found the solution. Basically you have to change
--lora_module_name decoder.layers.
to the appropriate name for you model, for example,--lora_module_name h.
for bloom and gpt-neo.Thanks for your suggestion! Do you know what is the lora_module_name for llama model?
你可以执行这行代码 from transformers import AutoModel model = AutoModel.from_pretrained("llama-7b-zpn") for name, module in model.named_modules(): print(name)
可以看到 layer.开头,--lora_module_name layers.这样写就可以了
Yup, 我是这么做的。现在已经可以跑通了。感谢哟 :D
nice