LLaMA-Factory 训练qwen-vl-2,5-3b:出现oom错误，但我的image质量确实比较高，调低cutofflength又会报shape mismatch

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

stage: sft
do_train: true

model_name_or_path: "Qwen/Qwen2.5-VL-3B-Instruct"
preprocessing_num_workers: 16
finetuning_type: lora
template: qwen2_vl
flash_attn: "auto"
trust_remote_code: true

dataset_dir: "data"
dataset: "qwen_vl_data"
cutoff_len: 24576
max_samples: 100000
packing: false
enable_thinking: true
image_max_pixels: 67108864
image_min_pixels: 1024
video_max_pixels: 65536
video_min_pixels: 256

learning_rate: 5.0e-05
num_train_epochs: 3.0
per_device_train_batch_size: 1
gradient_accumulation_steps: 32
lr_scheduler_type: "cosine"
max_grad_norm: 1.0
warmup_steps: 0
optim: "adamw_torch"
bf16: true

lora_rank: 8
lora_alpha: 16
lora_dropout: 0.0
lora_target: "all"
freeze_vision_tower: false
freeze_multi_modal_projector: false

logging_steps: 5
save_steps: 100
report_to: "none"
output_dir: "saves/Qwen2.5-VL-3B-Instruct/lora/train_2025-09-03-19-15-03"
plot_loss: true

ddp_timeout: 180000000
include_num_input_tokens_seen: true

deepspeed: "examples/deepspeed/ds_z3_offload_config.json"

用的是四卡H100,但还是会报oom错误，请问还有什么措施可以降显存吗，或者多卡并行有什么配置可以让token也切成四分分配到四个gpu上呢

Reproduction

Put your message here.

Others

No response

Sep 04 '25 04:09 YiJunSachs

或者多卡并行有什么配置可以让token也切成四分分配到四个gpu上呢

Sequence parallelism will be supported in the next version.

Sep 05 '25 08:09 Kuangdd01

你好，那我想降低max_pixels+cutofflen是不是也可以降低显存使用

Sep 09 '25 04:09 YiJunSachs

是的

Sep 10 '25 14:09 Kuangdd01

你试过packing=true嘛，好像设置了也无效

Oct 15 '25 06:10 AriesJin