LLaMA-Factory icon indicating copy to clipboard operation
LLaMA-Factory copied to clipboard

SFT 时出现 StopIteration

Open AIR-hl opened this issue 9 months ago • 3 comments

Reminder

  • [X] I have read the README and searched the existing issues.

Reproduction

命令行方式在自定义的训练集和验证集上均无法正常启动训练 image

数据集格式如下: image

data_info.json中配置如下:

  "hh-rlhf-chosen-train": {
    "file_name": "hh-rlhf-chosen-train.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant",
      "system_tag": "system"
    }
  },
  "hh-rlhf-chosen-test": {
    "file_name": "hh-rlhf-chosen-test.json",
    "formatting": "sharegpt",
    "columns": {
      "messages": "messages"
    },
    "tags": {
      "role_tag": "role",
      "content_tag": "content",
      "user_tag": "user",
      "assistant_tag": "assistant",
      "system_tag": "system"
    }
  },

SFT的yaml配置文件:

# model
model_name_or_path: ../phi-1.5

# method
stage: sft
do_train: true
do_evaluate: true
finetuning_type: lora
lora_rank: 16
lora_target: q_proj,v_proj

# dataset
dataset: hh-rlhf-chosen-train
template: phi
cutoff_len: 1024
max_samples: -1
val_size: 0
overwrite_cache: true
preprocessing_num_workers: 16

# output
output_dir: saves/phi-1.5/sft
save_steps: 100
save_total_limit: 10
plot_loss: true
overwrite_output_dir: true

# train
pure_bf16: true
per_device_train_batch_size: 4
flash_attn: fa2
gradient_accumulation_steps: 8
gradient_checkpointing: true
learning_rate: 0.00005
lr_scheduler_type: cosine
num_train_epochs: 1.0
warmup_steps: 100


# eval
per_device_eval_batch_size: 2
evaluation_strategy: steps
eval_steps: 500
eval_accumulation_steps: 2
bf16_full_eval: true

# log
logging_first_step: true
logging_steps: 5
report_to: tensorboard

执行命令如下:

CUDA_VISIBLE_DEVICES=0 llamafactory-cli train phi-1.5_lora_sft_hh.yaml

Expected behavior

No response

System Info

No response

Others

webui方式可以启动训练

AIR-hl avatar May 10 '24 12:05 AIR-hl

datasets 库是否最新,数据集总量是否达到至少 50 条

hiyouga avatar May 10 '24 12:05 hiyouga

datasets 库是否最新,数据集总量是否达到至少 50 条

datasets是最新版的2.19.1,数据量在80k+

AIR-hl avatar May 10 '24 12:05 AIR-hl

max_samples: 10000000

hiyouga avatar May 10 '24 13:05 hiyouga

max_samples: 10000000

添加后 StopIteration不出现了,但出现了 ValueError: Target modules {'q_proj', 'v_proj'} not found in the base model. Please check the target modules and try again. image

phi-1.5换为phi-2后依然如此,readme.md中确实是q_proj, v_proj image


还有请问执行训练命令后终端中打印出了一条数据的input_ids, inputs等内容,这是正常的吗? image

AIR-hl avatar May 10 '24 15:05 AIR-hl

正常 你可能需要升级 phi 模型文件,或者改成 lora_targets: all

hiyouga avatar May 10 '24 16:05 hiyouga

正常 你可能需要升级 phi 模型文件,或者改成 lora_targets: all

十分感谢!麻烦再问您一下升级模型文件是指什么?现在用的是官方的提供的microsoft/phi-2microsoft/phi-1_5。真的十分感谢!

AIR-hl avatar May 10 '24 16:05 AIR-hl

建议用改 lora_target 的方法

hiyouga avatar May 10 '24 16:05 hiyouga

建议用改 lora_target 的方法

感谢!修改lora_target可行,但接着又会会出现 ValueError: PhiForCausalLM does not support gradient checkpointing. ValueError: PhiForCausalLM does not support Flash Attention 2.0 yet.

环境都是最新的,我先试试用其他模型吧!大佬注意休息!

AIR-hl avatar May 10 '24 16:05 AIR-hl

flash_attn: auto gradient_checkpointing: false

hiyouga avatar May 11 '24 14:05 hiyouga

flash_attn: auto gradient_checkpointing: false

想额外提个小问题,我在使用命令行启动训练lora模型时,保存的路径是自定义的,但我使用 webui 的Chat时想加载某个 checkpoint 的 adapter时,无法使用自定义路径,因为它会在路径前面默认加上一段路径,只能将保存了各个 checkpoint 的文件夹得路径改成指定的Gemma/lora,希望可以调整一下逻辑,去除掉这个默认路径,或者是单纯改为saves image image

AIR-hl avatar May 11 '24 14:05 AIR-hl

既然这样建议用 webchat 而非 webui

hiyouga avatar May 11 '24 16:05 hiyouga