LLaMA-Factory Qwen1.5-14B-Chat lora训练评估问题

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

deepspeed --include localhost:0,1 --master_port=9901 ./src/train_bash.py
--deepspeed ds_3.json
--stage sft
--model_name_or_path /opt/llm/models/Qwen15-14B-Chat
--do_predict
--dataset train_all
--template qwen
--finetuning_type lora
--adapter_name_or_path ./output_qw/checkpoint-200
--output_dir ./output_qw/finall_eval
--per_device_eval_batch_size 1
--max_samples 50
--predict_with_generate
--cutoff_len 9000
--max_new_tokens 9000
--seed 42
--fp16

这是deepspeed配置： { "train_batch_size": "auto", "train_micro_batch_size_per_gpu": "auto", "gradient_accumulation_steps": "auto", "gradient_clipping": "auto", "zero_allow_untested_optimizer": true, "fp16": { "enabled": "auto", "loss_scale": 0, "initial_scale_power": 16, "loss_scale_window": 1000, "hysteresis": 2, "min_loss_scale": 1 },
"zero_optimization": { "stage": 3, "contiguous_gradients": true, "stage3_max_live_parameters": 1e9, "stage3_max_reuse_distance": 1e9, "stage3_prefetch_bucket_size": 1e9, "stage3_param_persistence_threshold": "auto", "reduce_bucket_size": 5e8, "sub_group_size": 1e9, "overlap_comm": true, "stage3_gather_16bit_weights_on_model_save": true, "offload_optimizer": { "device": "cpu" }, "offload_param": { "device": "cpu" } } }

Expected behavior

使用deepspeed完成lora训练后，打算用do_predict做一个自己训练集的评估。单纯使用python启动模型评估会显存溢出。打算使用deepspeed+zero3来完成模型加载与评估。现在能使用deespeed加载模型，但好像在评估的时候卡住了，无法进行。现在已经40分钟了，还是没有输出。请帮忙看看什么问题。

System Info

transformers version: 4.39.1 Python version: 3.10.0 PyTorch version (GPU?): 2.1.2+cu118 (True) deepspeed version: 0.14.0

Others

No response

Apr 01 '24 02:04 Qiang-HU

试试设置 --fp16_full_eval True

Apr 01 '24 09:04 hiyouga

@hiyouga 我在训练Qwen1.5-moe的时候也遇到这个问题。开始训练后没有任何输出和报错，就这样卡住了。我最开始以为是没有print_loss的bug，但是运行了超过20个小时也依旧没有结果？用的是2.0.1+cu118、transformers是从github安装的最新的4.40.0。现在不确定是transformers库支持问题还是可能的框架问题？