ms-swift icon indicating copy to clipboard operation
ms-swift copied to clipboard

--max_length 参数是否不对自定义数据集生效

Open kratorado opened this issue 1 year ago • 8 comments

我在加入自定义数据集训练的时候,发现会偶发性地爆显存,后来手动去除了较长的样本后,才正常,但是我是加了--max_length参数的,似乎没有生效 --max_length 参数是否不对自定义数据集生效?如果确实不生效,能不能增加这个选项呢?

kratorado avatar Jul 04 '24 12:07 kratorado

自定义数据集是指 --dataset {local_path} 这种嘛

Jintao-Huang avatar Jul 04 '24 13:07 Jintao-Huang

自定义数据集是指 --dataset {local_path} 这种嘛

是的

kratorado avatar Jul 04 '24 15:07 kratorado

我在加入自定义数据集训练的时候,发现会偶发性地爆显存,后来手动去除了较长的样本后,才正常,但是我是加了--max_length参数的,似乎没有生效 --max_length 参数是否不对自定义数据集生效?如果确实不生效,能不能增加这个选项呢?

您好,我也遇到了这个问题,请问您怎么解决的呢

fly-dragon211 avatar Jul 07 '24 15:07 fly-dragon211

我在加入自定义数据集训练的时候,发现会偶发性地爆显存,后来手动去除了较长的样本后,才正常,但是我是加了--max_length参数的,似乎没有生效 --max_length 参数是否不对自定义数据集生效?如果确实不生效,能不能增加这个选项呢?

您好,我也遇到了这个问题,请问您怎么解决的呢

手动过滤呀

kratorado avatar Jul 08 '24 02:07 kratorado

我这里测试LLM是没问题的, 你是多模态LLM嘛

Jintao-Huang avatar Jul 08 '24 04:07 Jintao-Huang

我这里测试LLM是没问题的, 你是多模态LLM嘛

glm4-9b-chat

kratorado avatar Jul 08 '24 05:07 kratorado

那应该是会将超过max_length的去掉的 训练时,在命令行中会输出数据集的统计量,可否找一下看看呢

Jintao-Huang avatar Jul 08 '24 11:07 Jintao-Huang

I fine-tuned InternVL2-2B on a custom dataset with the following command:

/home/ecuser/swift/swift/cli/sft.py --model_type internvl2-2b --model_id_or_path /home/ecuser/.cache/modelscope/hub/OpenGVLab/InternVL2-2B --dataset dataset.jsonl --max_length 1000000 --report_to wandb --num_train_epochs 5 --use_flash_attn True --lora_target_modules ALL

Running inference with the code provided here I get: "Token indices sequence length is longer than the specified maximum sequence length for this model (26333 > 8192). Running this sequence through the model will result in indexing errors". It looks like --max_length 1000000 did not take any effect.

How to LoRA fine-tune a model expanding the max lenght ?

rokopi-byte avatar Jul 29 '24 17:07 rokopi-byte

I fine-tuned InternVL2-2B on a custom dataset with the following command:

/home/ecuser/swift/swift/cli/sft.py --model_type internvl2-2b --model_id_or_path /home/ecuser/.cache/modelscope/hub/OpenGVLab/InternVL2-2B --dataset dataset.jsonl --max_length 1000000 --report_to wandb --num_train_epochs 5 --use_flash_attn True --lora_target_modules ALL

Running inference with the code provided here I get: "Token indices sequence length is longer than the specified maximum sequence length for this model (26333 > 8192). Running this sequence through the model will result in indexing errors". It looks like --max_length 1000000 did not take any effect.

How to LoRA fine-tune a model expanding the max lenght ?

Sure this will has no effect, because the model only support at most 8192 tokens If you want to expand the max model length, consider training with --rope_scaling xxx: https://swift.readthedocs.io/en/latest/LLM/Command-line-parameters.html

tastelikefeet avatar Aug 28 '24 07:08 tastelikefeet