DeepSpeed icon indicating copy to clipboard operation
DeepSpeed copied to clipboard

[BUG] Hanging problem when LoRA sft Qwen2.5Omni with multi-turn video-audio samples with ds-z3 or ds-z3-offload

Open Luffy966 opened this issue 8 months ago • 0 comments

Describe the bug We‘ve discussed a lot in llamafactory community: https://github.com/hiyouga/LLaMA-Factory/issues/7767 And it's weird that this issue only happens when we use ds-z3 or ds-z3-offload.

To Reproduce

  1. Env preparation:
# prepare transformers with the following cmd to train qwen2.5omni
pip install git+https://github.com/Kuangdd01/transformers.git@qwen25omni

# use this commit if training with mmdata in system prompt
pip install git+https://github.com/Luffy-ZY-Wang/LLaMA-Factory.git@dev_my_branch
  1. Data preparation: Following: https://github.com/hiyouga/LLaMA-Factory/issues/7767#issuecomment-2823732424
  2. llamafactory config:
### model
model_name_or_path: ./Qwen2.5-Omni-7B
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: ./examples/deepspeed/ds_z3_config.json


### dataset
dataset: mllm_mmsys_multiturn_video_audio_demo_alpaca
template: qwen2_omni
cutoff_len: 8192
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4

### output
output_dir: saves/qwen2_omni-7b-video/lora/sft
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false

### train
use_audio_in_video: true
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
freeze_vision_tower: true
learning_rate: 1.0e-4
num_train_epochs: 25.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000
resume_from_checkpoint: null

System info

  • llamafactory version: 0.9.3.dev0
  • Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
  • Python version: 3.12.9
  • PyTorch version: 2.6.0+cu118 (GPU)
  • Transformers version: 4.50.0.dev0
  • Datasets version: 3.4.1
  • Accelerate version: 1.5.2
  • PEFT version: 0.15.1
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 4090
  • GPU number: 8
  • GPU memory: 23.65GB
  • DeepSpeed version: 0.16.4
  • vLLM version: 0.8.1

Luffy966 avatar Apr 30 '25 03:04 Luffy966