ms-swift --max_length 参数是否不对自定义数据集生效

我在加入自定义数据集训练的时候，发现会偶发性地爆显存，后来手动去除了较长的样本后，才正常，但是我是加了--max_length参数的，似乎没有生效 --max_length 参数是否不对自定义数据集生效？如果确实不生效，能不能增加这个选项呢？

Jul 04 '24 12:07 kratorado

自定义数据集是指 --dataset {local_path} 这种嘛

Jul 04 '24 13:07 Jintao-Huang

自定义数据集是指 --dataset {local_path} 这种嘛

是的

Jul 04 '24 15:07 kratorado

我在加入自定义数据集训练的时候，发现会偶发性地爆显存，后来手动去除了较长的样本后，才正常，但是我是加了--max_length参数的，似乎没有生效 --max_length 参数是否不对自定义数据集生效？如果确实不生效，能不能增加这个选项呢？

您好，我也遇到了这个问题，请问您怎么解决的呢

Jul 07 '24 15:07 fly-dragon211

我在加入自定义数据集训练的时候，发现会偶发性地爆显存，后来手动去除了较长的样本后，才正常，但是我是加了--max_length参数的，似乎没有生效 --max_length 参数是否不对自定义数据集生效？如果确实不生效，能不能增加这个选项呢？

您好，我也遇到了这个问题，请问您怎么解决的呢

手动过滤呀

Jul 08 '24 02:07 kratorado

我这里测试LLM是没问题的, 你是多模态LLM嘛

Jul 08 '24 04:07 Jintao-Huang

我这里测试LLM是没问题的, 你是多模态LLM嘛

glm4-9b-chat

Jul 08 '24 05:07 kratorado

那应该是会将超过max_length的去掉的训练时，在命令行中会输出数据集的统计量，可否找一下看看呢

Jul 08 '24 11:07 Jintao-Huang

I fine-tuned InternVL2-2B on a custom dataset with the following command:

/home/ecuser/swift/swift/cli/sft.py --model_type internvl2-2b --model_id_or_path /home/ecuser/.cache/modelscope/hub/OpenGVLab/InternVL2-2B --dataset dataset.jsonl --max_length 1000000 --report_to wandb --num_train_epochs 5 --use_flash_attn True --lora_target_modules ALL

Running inference with the code provided here I get: "Token indices sequence length is longer than the specified maximum sequence length for this model (26333 > 8192). Running this sequence through the model will result in indexing errors". It looks like --max_length 1000000 did not take any effect.

How to LoRA fine-tune a model expanding the max lenght ?

Jul 29 '24 17:07 rokopi-byte

I fine-tuned InternVL2-2B on a custom dataset with the following command:

/home/ecuser/swift/swift/cli/sft.py --model_type internvl2-2b --model_id_or_path /home/ecuser/.cache/modelscope/hub/OpenGVLab/InternVL2-2B --dataset dataset.jsonl --max_length 1000000 --report_to wandb --num_train_epochs 5 --use_flash_attn True --lora_target_modules ALL

Running inference with the code provided here I get: "Token indices sequence length is longer than the specified maximum sequence length for this model (26333 > 8192). Running this sequence through the model will result in indexing errors". It looks like --max_length 1000000 did not take any effect.

How to LoRA fine-tune a model expanding the max lenght ?

Sure this will has no effect, because the model only support at most 8192 tokens If you want to expand the max model length, consider training with --rope_scaling xxx: https://swift.readthedocs.io/en/latest/LLM/Command-line-parameters.html

Aug 28 '24 07:08 tastelikefeet