[Bug] 微调internvl3.5_1B模型时，开启use_packed_ds但没有效果，num_samples均固定，且每条样本都达到最长token数量

Open Cloud-Jowen opened this issue 3 months ago • 0 comments

Checklist

[x] 1. I have searched related issues but cannot get the expected help.
[x] 2. The bug has not been fixed in the latest version.
[x] 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.

Describe the bug

我在训练 internvl3.5_1B 时，开启了 use_packed_ds

训练shell脚本：/InternVL3_5/internvl_chat_gpt_oss/shell/internvl3_5_qwen3/internvl3_5_1b_sft.sh 对应的.py文件：/InternVL3_5/internvl_chat_gpt_oss/internvl/train/internvl_chat_finetune.py 训练任务是单图单轮对话

shell的配置如下： NPROC_PER_NODE=${NPROC_PER_NODE:-4} BATCH_SIZE=${BATCH_SIZE:-32} PER_DEVICE_BATCH_SIZE=${PER_DEVICE_BATCH_SIZE:-1} GRADIENT_ACC=$((BATCH_SIZE / PER_DEVICE_BATCH_SIZE / NPROC_PER_NODE)) ... --force_image_size 448
--max_dynamic_patch 6
--max_seq_length 16384
... --gradient_checkpointing True
--group_by_length False
--dynamic_image_size True
--use_thumbnail True
--ps_version 'v2'
--use_custom_flash_attn False
--report_to "tensorboard"
--deepspeed "zero_stage1_config.json"
--use_packed_ds True
--num_images_expected 96
--max_packed_tokens 32768
--max_buffer_size 20 \

训练时发现即使设置 use_packed_ds 为 True，每次打包的样本数量固定为2，且每个样本的token数量固定为max_seq_length。

经排查，是 internvl_chat_finetune.py 文件里 multi_modal_get_item 函数内部在使用前处理函数preprocess_function的时候，少传入了一个 use_packed_ds 超参数，导致后续的样本都被pad到最长长度了

源代码

修改后的代码

有相同问题的同学可以参考这个修改一下

Reproduction

...

Environment

官方的conda环境

Error traceback

Sep 10 '25 02:09 Cloud-Jowen