Baichuan-7B icon indicating copy to clipboard operation
Baichuan-7B copied to clipboard

实现了baichuan-7B模型的LoRA微调

Open hiyouga opened this issue 2 years ago • 101 comments

支持Alpaca等指令数据集的SFT和RLHF流程:https://github.com/hiyouga/LLaMA-Efficient-Tuning

LoRA微调可在单块3090 GPU上运行,同时支持QLoRA方法。(最低12G显存)

微调模型的 LoRA 权重:https://huggingface.co/hiyouga/baichuan-7b-sft

运行以下指令即可实现 Alpaca 数据集指令微调(instruction-tuning):

CUDA_VISIBLE_DEVICES=0 python src/train_sft.py \
    --model_name_or_path baichuan-7B模型文件夹路径或huggingface地址 \
    --do_train \
    --dataset alpaca_gpt4_zh \
    --finetuning_type lora \
    --lora_rank 8 \
    --lora_target W_pack \
    --output_dir alpaca_baichuan \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --learning_rate 5e-5 \
    --max_grad_norm 0.5 \
    --num_train_epochs 3.0 \
    --dev_ratio 0.01 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --plot_loss \
    --fp16

程序运行截图示例: 20230615160340

经过LoRA指令微调后的对话效果: 20230615164836

hiyouga avatar Jun 15 '23 08:06 hiyouga

牛逼,好快啊

Chenzongchao avatar Jun 15 '23 08:06 Chenzongchao

牛逼

SMR-S avatar Jun 15 '23 08:06 SMR-S

大佬太强了

70557dzqc avatar Jun 15 '23 08:06 70557dzqc

支持Alpaca等指令数据集的SFT和RLHF流程:https://github.com/hiyouga/LLaMA-Efficient-Tuning

运行以下指令即可实现 Alpaca 数据集指令微调(instruction-tuning):

CUDA_VISIBLE_DEVICES=0 python src/train_sft.py \
    --model_name_or_path baichuan-7B模型文件夹路径 \
    --do_train \
    --dataset alpaca_gpt4_zh \
    --finetuning_type lora \
    --lora_rank 8 \
    --lora_target W_pack \
    --output_dir alpaca_baichuan \
    --per_device_train_batch_size 4 \
    --per_device_eval_batch_size 4 \
    --gradient_accumulation_steps 8 \
    --lr_scheduler_type cosine \
    --logging_steps 10 \
    --save_steps 100 \
    --eval_steps 100 \
    --learning_rate 5e-5 \
    --max_grad_norm 0.5 \
    --num_train_epochs 3.0 \
    --dev_ratio 0.01 \
    --evaluation_strategy steps \
    --load_best_model_at_end \
    --plot_loss \
    --fp16

程序运行截图示例: 20230615160340

有微调数据集格式吗?

GalSang17 avatar Jun 15 '23 08:06 GalSang17

@GalSang17 项目自带了,点进data文件夹就可以看示例格式。

hiyouga avatar Jun 15 '23 08:06 hiyouga

@GalSang17 项目自带了,点进data文件夹就可以看示例格式。

谢谢!

GalSang17 avatar Jun 15 '23 08:06 GalSang17

赞👍🏻

suncheng-s avatar Jun 15 '23 08:06 suncheng-s

@hiyouga 没有出现这个错误吗?

./aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [51,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [52,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [53,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [54,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [55,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [56,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [57,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [58,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
../aten/src/ATen/native/cuda/IndexKernel.cu:92: operator(): block: [124,0,0], thread: [59,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

bytes-lost avatar Jun 15 '23 09:06 bytes-lost

@bytes-lost 完整的报错信息是什么?哪一行代码导致的?

hiyouga avatar Jun 15 '23 09:06 hiyouga

@hiyouga

[INFO|trainer.py:622] 2023-06-15 17:12:03,926 >> Using cuda_amp half precision backend
[INFO|trainer.py:1779] 2023-06-15 17:12:03,933 >> ***** Running training *****
[INFO|trainer.py:1780] 2023-06-15 17:12:03,934 >>   Num examples = 48,329
[INFO|trainer.py:1781] 2023-06-15 17:12:03,934 >>   Num Epochs = 3
[INFO|trainer.py:1782] 2023-06-15 17:12:03,934 >>   Instantaneous batch size per device = 4
[INFO|trainer.py:1783] 2023-06-15 17:12:03,934 >>   Total train batch size (w. parallel, distributed & accumulation) = 32
[INFO|trainer.py:1784] 2023-06-15 17:12:03,934 >>   Gradient Accumulation steps = 8
[INFO|trainer.py:1785] 2023-06-15 17:12:03,934 >>   Total optimization steps = 4,530
[INFO|trainer.py:1786] 2023-06-15 17:12:03,935 >>   Number of trainable parameters = 4,194,304

0%|          | 0/4530 [00:00<?, ?it/s]
  0%|          | 1/4530 [00:04<5:45:55,  4.58s/it]
  0%|          | 2/4530 [00:07<4:42:43,  3.75s/it]Traceback (most recent call last):
  File "/mnt/data/user/LLaMA-Efficient-Tuning/src/train_sft.py", line 97, in <module>
    main()
  File "/mnt/data/user/LLaMA-Efficient-Tuning/src/train_sft.py", line 69, in main
    train_result = trainer.train()
  File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/trainer.py", line 1664, in train
    return inner_training_loop(
  File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/trainer.py", line 1940, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/trainer.py", line 2735, in training_step
    loss = self.compute_loss(model, inputs)
  File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/transformers/trainer.py", line 2767, in compute_loss
    outputs = model(**inputs)
  File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/peft/peft_model.py", line 678, in forward
    return self.base_model(
  File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/baichuan-7b/modeling_baichuan.py", line 617, in forward
    outputs = self.model(
  File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/baichuan-7b/modeling_baichuan.py", line 501, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
  File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 89, in forward
    ctx.fwd_gpu_devices, ctx.fwd_gpu_states = get_device_states(*args)
  File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 50, in get_device_states
    fwd_gpu_states.append(torch.cuda.get_rng_state())
  File "/mnt/data/anaconda3/envs/llama/lib/python3.9/site-packages/torch/cuda/random.py", line 31, in get_rng_state
    return default_generator.get_state()
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

bytes-lost avatar Jun 15 '23 09:06 bytes-lost

@bytes-lost 应该是数组越界了,我在加载 tokenizer 时手动将 pad_token_id 设置为了 0,检查一下你那边有没有设置。输入序列中不能有大于等于 64000 的值。

hiyouga avatar Jun 15 '23 09:06 hiyouga

@hiyouga 我在train_sft.py这里加上了一行,但是还是一样的报错

model, tokenizer = load_pretrained(model_args, finetuning_args, training_args.do_train, stage="sft")
tokenizer.pad_token_id = 0  # 指定pad_token_id
dataset = preprocess_data(dataset, tokenizer, data_args, training_args, stage="sft")

bytes-lost avatar Jun 15 '23 09:06 bytes-lost

@bytes-lost 看起来是 torch 的 checkpointing 过程出现了问题,可能和本地的 torch 以及 CUDA 环境有关,我这边测试了好几遍都没有问题。

hiyouga avatar Jun 15 '23 09:06 hiyouga

@bytes-lost 看起来是 torch 的 checkpointing 过程出现了问题,可能和本地的 torch 以及 CUDA 环境有关,我这边测试了好几遍都没有问题。

好的,我重新创建环境测测看,torch=2.0.1版本是可以的吗?

bytes-lost avatar Jun 15 '23 09:06 bytes-lost

我这边一直在自己对话,而且“你是谁”,也不是需要的答案,微调代码跟上面提供的一模一样的呢 image

gebilaoman avatar Jun 15 '23 11:06 gebilaoman

@gebilaoman 用项目自带 cli_demo 启动时请添加 --prompt_template ziya 参数

hiyouga avatar Jun 15 '23 11:06 hiyouga

好快的速度,好猛

Xin-20 avatar Jun 15 '23 12:06 Xin-20

我这边也实现了baichuan-7b 的lora微调,baichuan模型的结构跟llama一致,它的SFT微调方法跟bloom/llama基本一致的。

支持baichuan-7b微调项目地址:https://github.com/shibing624/MedicalGPT

该项目还实现了GPT模型训练,包括二次预训练、有监督微调、奖励建模、强化学习训练。

运行以下指令即可实现 belle 数据集指令微调(instruction-tuning):

python3 supervised_finetuning.py \
    --model_type auto \
    --model_name_or_path baichuan-inc/baichuan-7B \
    --train_file_dir ./data/finetune \
    --validation_file_dir ./data/finetune \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 1 \
    --do_train \
    --do_eval \
    --use_peft True \
    --max_train_samples 1000 \
    --max_eval_samples 10 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --warmup_ratio 0.05 \
    --weight_decay 0.05 \
    --logging_strategy steps \
    --logging_steps 10 \
    --eval_steps 50 \
    --evaluation_strategy steps \
    --save_steps 500 \
    --save_strategy steps \
    --save_total_limit 3 \
    --gradient_accumulation_steps 1 \
    --preprocessing_num_workers 1 \
    --max_source_length 256 \
    --max_target_length 256 \
    --output_dir outputs-sft-baichuan-v1 \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --target_modules all \
    --lora_rank 8 \
    --lora_alpha 16 \
    --lora_dropout 0.05 \
    --fp16 \
    --torch_dtype float16 \
    --device_map auto \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True

运行过程截图(loss 稳定下降): Xnip2023-06-15_20-57-19

Xnip2023-06-15_20-57-34

欢迎大家测试,验证效果。

shibing624 avatar Jun 15 '23 13:06 shibing624

@bytes-lost 看起来是 torch 的 checkpointing 过程出现了问题,可能和本地的 torch 以及 CUDA 环境有关,我这边测试了好几遍都没有问题。

好的,我重新创建环境测测看,torch=2.0.1版本是可以的吗?

我同样的问题tokenizer.pad_token_id = 0 之后就可以了

XiaofengZHOU avatar Jun 15 '23 13:06 XiaofengZHOU

image 运行报这个错误,是要改模型里的config.json吗?

weicheng59 avatar Jun 16 '23 01:06 weicheng59

image 运行报这个错误,是要改模型里的config.json吗?

不是 ChatGLM 的代码,是 LLAMA 那一份。https://github.com/hiyouga/LLaMA-Efficient-Tuning

suncheng-s avatar Jun 16 '23 01:06 suncheng-s

能实现多轮对话的微调吗,具体多轮对话的数据格式能不能演示一下谢谢

usun1997 avatar Jun 16 '23 02:06 usun1997

@usun1997 支持多轮对话,格式参考:https://github.com/hiyouga/LLaMA-Efficient-Tuning/blob/main/data/example_dataset/examples.json

hiyouga avatar Jun 16 '23 02:06 hiyouga

@hiyouga 你好,项目自带 cli_demo 启动时,为什么要添加 --prompt_template ziya 参数? 为什么是ziya?不应该是baichuan吗

cristianohello avatar Jun 16 '23 02:06 cristianohello

@cristianohello 因为我微调时候用的是 ziya 的 template😁 @usun1997 正确。

hiyouga avatar Jun 16 '23 02:06 hiyouga

@hiyouga 你好,感谢回复。 又遇到连续自问自答的情况,如何解决?

cristianohello avatar Jun 16 '23 03:06 cristianohello

@cristianohello 目前的 SFT 模型没有进行多轮对话训练,所以多轮时候偶尔会出现问题。

hiyouga avatar Jun 16 '23 03:06 hiyouga

@usun1997 支持多轮对话,格式参考:https://github.com/hiyouga/LLaMA-Efficient-Tuning/blob/main/data/example_dataset/examples.json

多谢。我示范一下我对格式的理解,您看对不对。如果说我微调数据里只有一次对话话题,这次对话有三轮。

[ { "instruction": "我的最后一轮对话问题", "input": "", "output": "模型的最后一轮对话回答", "history": [ ["我的第一轮对话问题", "模型的第一轮对话回答"], ["我的第二轮对话问题", "模型的第二轮对话回答"] ] } ]

是不是说,如果在列表中的type为dict的对话数据的keys中存在history,意味着这个dict类型对话数据应该是多轮对话,然后它一开始的instruction, input和 output都代表的是最后一轮的问答,然后在history中,按index顺序排列对话顺序。

usun1997 avatar Jun 16 '23 03:06 usun1997

@hiyouga 我的情况是 image

输入你是谁问题,它就自问自答很多轮才结束,如何让他一问一答呢

cristianohello avatar Jun 16 '23 03:06 cristianohello

@cristianohello 因为我微调时候用的是 ziya 的 template😁 @usun1997 正确。

好的感谢

usun1997 avatar Jun 16 '23 03:06 usun1997