LLaMA-Factory SFT 时出现 StopIteration

Reminder

[X] I have read the README and searched the existing issues.

Reproduction

命令行方式在自定义的训练集和验证集上均无法正常启动训练

数据集格式如下：

data_info.json中配置如下：

  "hh-rlhf-chosen-train": {
    "file_name": "hh-rlhf-chosen-train.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant",
      "system_tag": "system"
    }
  },
  "hh-rlhf-chosen-test": {
    "file_name": "hh-rlhf-chosen-test.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant",
      "system_tag": "system"
    }
  },

SFT的yaml配置文件：

# model
model_name_or_path: ../phi-1.5

# method
stage: sft
do_train: true
do_evaluate: true
finetuning_type: lora
lora_rank: 16
lora_target: q_proj,v_proj

# dataset
dataset: hh-rlhf-chosen-train
template: phi
cutoff_len: 1024
max_samples: -1
val_size: 0
overwrite_cache: true
preprocessing_num_workers: 16

# output
output_dir: saves/phi-1.5/sft
save_steps: 100
save_total_limit: 10
plot_loss: true
overwrite_output_dir: true

# train
pure_bf16: true
per_device_train_batch_size: 4
flash_attn: fa2
gradient_accumulation_steps: 8
gradient_checkpointing: true
learning_rate: 0.00005
lr_scheduler_type: cosine
num_train_epochs: 1.0
warmup_steps: 100


# eval
per_device_eval_batch_size: 2
evaluation_strategy: steps
eval_steps: 500
eval_accumulation_steps: 2
bf16_full_eval: true

# log
logging_first_step: true
logging_steps: 5
report_to: tensorboard

执行命令如下：

CUDA_VISIBLE_DEVICES=0 llamafactory-cli train phi-1.5_lora_sft_hh.yaml

Expected behavior

No response

System Info

No response

Others

webui方式可以启动训练

May 10 '24 12:05 AIR-hl

datasets 库是否最新，数据集总量是否达到至少 50 条

May 10 '24 12:05 hiyouga

datasets 库是否最新，数据集总量是否达到至少 50 条

datasets是最新版的2.19.1，数据量在80k+

May 10 '24 12:05 AIR-hl

max_samples: 10000000

May 10 '24 13:05 hiyouga

max_samples: 10000000

添加后 StopIteration不出现了，但出现了 ValueError: Target modules {'q_proj', 'v_proj'} not found in the base model. Please check the target modules and try again.

由phi-1.5换为phi-2后依然如此，readme.md中确实是q_proj, v_proj

还有请问执行训练命令后终端中打印出了一条数据的input_ids, inputs等内容，这是正常的吗？

May 10 '24 15:05 AIR-hl

正常你可能需要升级 phi 模型文件，或者改成 lora_targets: all

May 10 '24 16:05 hiyouga

正常你可能需要升级 phi 模型文件，或者改成 lora_targets: all

十分感谢！麻烦再问您一下升级模型文件是指什么？现在用的是官方的提供的microsoft/phi-2和microsoft/phi-1_5。真的十分感谢！

May 10 '24 16:05 AIR-hl

建议用改 lora_target 的方法

May 10 '24 16:05 hiyouga

建议用改 lora_target 的方法

感谢！修改lora_target可行，但接着又会会出现 ValueError: PhiForCausalLM does not support gradient checkpointing. ValueError: PhiForCausalLM does not support Flash Attention 2.0 yet.

环境都是最新的，我先试试用其他模型吧！大佬注意休息！

May 10 '24 16:05 AIR-hl

flash_attn: auto gradient_checkpointing: false

May 11 '24 14:05 hiyouga

flash_attn: auto gradient_checkpointing: false

想额外提个小问题，我在使用命令行启动训练lora模型时，保存的路径是自定义的，但我使用 webui 的Chat时想加载某个 checkpoint 的 adapter时，无法使用自定义路径，因为它会在路径前面默认加上一段路径，只能将保存了各个 checkpoint 的文件夹得路径改成指定的Gemma/lora，希望可以调整一下逻辑，去除掉这个默认路径，或者是单纯改为saves

May 11 '24 14:05 AIR-hl

既然这样建议用 webchat 而非 webui

May 11 '24 16:05 hiyouga

LLaMA-Factory LLaMA-Factory copied to clipboard

SFT 时出现 StopIteration

Reminder

Reproduction

Expected behavior

System Info

Others

LLaMA-Factory
LLaMA-Factory copied to clipboard