InternVL About sft efficiency for InternVL3.5

Hi, I was trying to sft InternVL3.5-4B with a GPU cluster 128xA100. Yet for bs=512, grad_accm=4, per steps still takes around 1 minute which is quite slow for a 4B model (I also freeze the ViT). I'm wondering is there any reason for such low sft efficiency?

Sep 05 '25 21:09 NeutrinoLiu

I also found this problem. Apart from this, I found the memory assumption is also bigger than internvl2.5, about double. This is annoyed. And flash attention was not supported.... @Weiyun1025 , Is there any suggestions?

Sep 08 '25 12:09 BruceYu-Bit

Thank you for your interest in our work. Our training script for 4B model set use_packed_ds=True by default, which packs multiple samples into a single sequence. In such case, per_device_train_batch_size should be set to $1$, since each sample during packed training has already comprised a series of samples. If you set per_device_train_batch_size to $512$, the actual training batch size is much larger than $512$.

Sep 08 '25 12:09 Weiyun1025

Thank you for your interest in our work. Our training script for 4B model set use_packed_ds=True by default, which packs multiple samples into a single sequence. In such case, per_device_train_batch_size should be set to 1 , since each sample during packed training has already comprised a series of samples. If you set per_device_train_batch_size to 512 , the actual training batch size is much larger than 512 .

Thank you for your patience. How about memory assumption, is there any solution ?

Sep 08 '25 13:09 BruceYu-Bit

You can try to set max_packed_tokens and num_images_expected smaller. BTW, our codebase supports flash attention, can you share me more information about the issue you encounter when using flash attention?

Sep 08 '25 13:09 Weiyun1025

You can try to set max_packed_tokens and num_images_expected smaller. BTW, our codebase supports flash attention, can you share me more information about the issue you encounter when using flash attention?

Thank you once again. i am finetuning with internvl3.5-2b, and set use_custom_flash_attn=True, encounter the error like this : https://github.com/OpenGVLab/InternVL/issues/1135. RuntimeError: The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 3. why 3.5 can not set use_custom_flash_attn?

Sep 09 '25 02:09 BruceYu-Bit

You can try to set max_packed_tokens and num_images_expected smaller. BTW, our codebase supports flash attention, can you share me more information about the issue you encounter when using flash attention?

BTW, i tried close use_packed_ds, and the memory cannot be significantly reduced.

Sep 09 '25 02:09 BruceYu-Bit

Thank you for your interest in our work. Our training script for 4B model set use_packed_ds=True by default, which packs multiple samples into a single sequence. In such case, per_device_train_batch_size should be set to 1 , since each sample during packed training has already comprised a series of samples. If you set per_device_train_batch_size to 512 , the actual training batch size is much larger than 512 .

For the same SFT data, I compare InternVL3-8B and InternVL3_5-4B, and the training durations are 90 and 180 hours respectively. I've already set use_packed_ds=False for both cases to strictly control batch size for each training step.

The main differences also include --gradient_checkpointing True, --group_by_length False and --use_custom_flash_attn False for InternVL3_5-4B, as suggested in the provided scripts, as well as --split_annotations False which I don't think matters.

As a smaller model, why InternVL3_5-4B training needs more time than InternVL3-8B?

FYI, the log of one training step for InternVL3-8B

^M  5%|▌         | 6399/126812 [4:51:39<99:43:04,  2.98s/it]dynamic ViT batch size: 9, images per sample: 4
.5, dynamic token length: 3408
[2025-10-09 20:35:43,870] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | optimizer_allgather: 165.80
 | optimizer_gradients: 4.19 | optimizer_step: 5.94
[2025-10-09 20:35:43,871] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd_microstep: 571.07 | bwd
_microstep: 1774.90 | bwd_inner_microstep: 1707.51 | bwd_allreduce_microstep: 67.34 | step_microstep: 187.0
1
[2025-10-09 20:35:43,872] [INFO] [logging.py:128:log_dist] [Rank 0] time (ms) | fwd: 571.06 | bwd: 1774.91 
| bwd_inner: 1707.51 | bwd_allreduce: 67.35 | step: 187.02

InternVL3_5-4B

^M  5%|▌         | 6399/126812 [8:45:01<161:49:15,  4.84s/it][2025-10-19 13:14:20,173] [INFO] [logging.py:1
07:log_dist] [Rank 0] time (ms) | optimizer_allgather: 117.42 | optimizer_gradients: 2.63 | optimizer_step:
 3.32
[2025-10-19 13:14:20,174] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd_microstep: 1120.06 | bw
d_microstep: 3522.59 | bwd_inner_microstep: 3309.21 | bwd_allreduce_microstep: 213.32 | step_microstep: 129
.75
[2025-10-19 13:14:20,174] [INFO] [logging.py:107:log_dist] [Rank 0] time (ms) | fwd: 1120.05 | bwd: 3522.60
 | bwd_inner: 3309.21 | bwd_allreduce: 213.34 | step: 129.75

Oct 19 '25 05:10 shuoyinn

Is there training efficiency optimization for InternVL3_5 that supports the non-packing training method like InternVL3?

Oct 19 '25 05:10 shuoyinn