Qwen-VL icon indicating copy to clipboard operation
Qwen-VL copied to clipboard

[BUG] Lora微调出错

Open Luoyang144 opened this issue 1 year ago • 2 comments

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • [X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • [X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

自己构建数据对模型进行lora微调,但训练到某一步会出现错误

 File "/xxx/.cache/huggingface/modules/transformers_modules/Qwen-VL-Chat/modeling_qwen.py", line 557, in forward
    assert (bos_pos[0] == eos_pos[0]).all()
RuntimeError: The size of tensor a (11) must match the size of tensor b (10) at non-singleton dimension 0
^M 24%|██▎       | 36/152 [22:57<1:13:57, 38.25s/it]```

### 期望行为 | Expected Behavior

能够完成训练,保存模型

### 复现方法 | Steps To Reproduce

环境按照requirements进行安装。
脚本如下:

#!/bin/bash export CUDA_DEVICE_MAX_CONNECTIONS=1 DIR=pwd

MODEL="../../LLMs/Qwen/Qwen-VL-Chat" #"Qwen/Qwen-VL-Chat"/"Qwen/Qwen-VL" # Set the path if you do not want to load from huggingface directly

ATTENTION: specify the path to your training data, which should be a json file consisting of a list of conversations.

See the section for finetuning in README for more information.

#DATA="data/debate/v4/filter_v4_3000_a2r3_CoD.json" #SAVE_DIT="checkpoints/lora/CoD" DATA=$1 SAVE_DIT=$2

export CUDA_VISIBLE_DEVICES=0

python finetune.py
--model_name_or_path $MODEL
--data_path $DATA
--bf16 True
--fix_vit True
--output_dir $SAVE_DIT
--num_train_epochs 2
--per_device_train_batch_size 4
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 100
--save_total_limit 10
--learning_rate 1e-5
--weight_decay 0.1
--adam_beta2 0.95
--warmup_ratio 0.01
--lr_scheduler_type "cosine"
--logging_steps 1
--report_to "none"
--model_max_length 2048
--lazy_preprocess True
--gradient_checkpointing
--use_lora


### 运行环境 | Environment

```Markdown
- OS:Ubuntu 
- Python:3.9
- Transformers:4.32.0
- PyTorch:2.2.0+cu121
- CUDA 12.2

备注 | Anything else?

No response

Luoyang144 avatar Feb 06 '24 04:02 Luoyang144

图片放在prompt后面被截断了吧,模型只能看到图片开始的special token,看不到结束的,数量不一致,所以报错了

buaacoder avatar Mar 13 '24 09:03 buaacoder

我也遇到了一样的问题,但把每轮对话的图片数量减少(从三十多张减到了六张)就不会发生了,会是图片数量限制的问题吗?

L1NINE avatar Aug 26 '24 12:08 L1NINE