LLaMA-Factory
LLaMA-Factory copied to clipboard
训练qwen-vl-2,5-3b:出现oom错误,但我的image质量确实比较高,调低cutofflength又会报shape mismatch
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
stage: sft
do_train: true
model_name_or_path: "Qwen/Qwen2.5-VL-3B-Instruct"
preprocessing_num_workers: 16
finetuning_type: lora
template: qwen2_vl
flash_attn: "auto"
trust_remote_code: true
dataset_dir: "data"
dataset: "qwen_vl_data"
cutoff_len: 24576
max_samples: 100000
packing: false
enable_thinking: true
image_max_pixels: 67108864
image_min_pixels: 1024
video_max_pixels: 65536
video_min_pixels: 256
learning_rate: 5.0e-05
num_train_epochs: 3.0
per_device_train_batch_size: 1
gradient_accumulation_steps: 32
lr_scheduler_type: "cosine"
max_grad_norm: 1.0
warmup_steps: 0
optim: "adamw_torch"
bf16: true
lora_rank: 8
lora_alpha: 16
lora_dropout: 0.0
lora_target: "all"
freeze_vision_tower: false
freeze_multi_modal_projector: false
logging_steps: 5
save_steps: 100
report_to: "none"
output_dir: "saves/Qwen2.5-VL-3B-Instruct/lora/train_2025-09-03-19-15-03"
plot_loss: true
ddp_timeout: 180000000
include_num_input_tokens_seen: true
deepspeed: "examples/deepspeed/ds_z3_offload_config.json"
用的是四卡H100,但还是会报oom错误,请问还有什么措施可以降显存吗,或者多卡并行有什么配置可以让token也切成四分分配到四个gpu上呢
Reproduction
Put your message here.
Others
No response
或者多卡并行有什么配置可以让token也切成四分分配到四个gpu上呢
Sequence parallelism will be supported in the next version.
你好,那我想降低max_pixels+cutofflen是不是也可以降低显存使用
是的
你试过packing=true嘛,好像设置了也无效