LLaMA-Factory Qwen-Omni在混合模态数据上dpo训练时，训练卡住

Reminder

[x] I have read the above rules and searched the existing issues.

System Info

[2025-05-25 15:24:24,879] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect) INFO 05-25 15:24:26 [importing.py:53] Triton module has been replaced with a placeholder. INFO 05-25 15:24:26 [init.py:239] Automatically detected platform cuda.

llamafactory version: 0.9.3.dev0
Platform: Linux-5.15.0-119-generic-x86_64-with-glibc2.35
Python version: 3.11.0
PyTorch version: 2.6.0+cu124 (GPU)
Transformers version: 4.50.0.dev0
Datasets version: 3.5.0
Accelerate version: 1.6.0
PEFT version: 0.15.1
TRL version: 0.9.6
GPU type: NVIDIA A800-SXM4-80GB
GPU number: 8
GPU memory: 79.32GB
DeepSpeed version: 0.16.7
vLLM version: 0.8.5
Git commit: 75d7c35fdf4bfef6124df184d41e8425802e91e4

Reproduction

**数据集定义如下**： "OURDATA_ALL_MIX_DPO": {
        "file_name": "OURDATA_ALL_MIX_DPO.json",
        "formatting": "sharegpt",
        "ranking": true,
        "columns": {
            "messages": "conversations",
            "chosen": "chosen",
            "rejected": "rejected",
            "images": "images",
            "audios": "audios",
            "videos": "videos"
        }
    },
**训练配置如下：**
### model
model_name_or_path: "/data/LLaMA-Factory/saves/ANSWER_POSITION/ourdata_as_cot_others_no_cot_1/checkpoint-1200/7168final" # ourdata sft基础上训练
image_max_pixels: 262144
video_max_pixels: 16384
video_fps: 1
video_maxlen: 48 # 视频输入最大总帧数 # 原版是针对单视频的，后修改源码改成了多视频时按视频时长分配maxlen
trust_remote_code: true

### method
stage: dpo
do_train: true
finetuning_type: full
freeze_vision_tower: true
freeze_multi_modal_projector: true
freeze_language_model: false
deepspeed: examples/deepspeed/ds_z3_config.json

### dataset
dataset: OURDATA_ALL_MIX_DPO
template: qwen2_omni
cutoff_len: 4096
max_samples: 1000000
overwrite_cache: true
preprocessing_num_workers: 100
dataloader_num_workers: 100

### output
output_dir: saves/ourdata_dpo/OURDATA_ALL_MIX_DPO
logging_steps: 10
save_steps: 200
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1 # 1
gradient_accumulation_steps: 16 # 16
learning_rate: 5.0e-6
num_train_epochs: 1.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
**问题**
当使用如上的配置以及数据集在Qwen-omni-7B上进行dpo训练时，训练会一直卡在第一个step。 数据集是多种模态混合的，具体来说， 会包含 Text->Text, Image+Text->Text, Video+Text -> Text, Audio+Text -> Text， 这四种数据组合。 在这个训练之前，我们已经尝试了对Qwen-Omni-7B这四种数据上分别使用dpo训练， 均可以成功训练，但是当我们将这四中数据混合起来一起对Qwen-Omni-7B进行dpo训练时， 便会一直卡住第一个step。 我们尝试了： 1. 将video_fps 降低到1，以及设置 video_maxlen: 48。 但是这个不起作用。 2. 在训练之前设置环境变量， export NCCL_P2P_LEVEL=NVL。 这个也不起作用。 
我们不清楚，是当前不支持这种混合数据的dpo训练方式吗， 在sft中使用这种混合数据训练是没有问题的。

Others

以下是数据集中的一些训练样本： Audio+Text -> Text: {"conversations": [{"from": "human", "value": "

May 25 '25 07:05 wwfnb

hi能试一下zero2看下报错情况吗 zero3 hang住的问题可能一下看不太出来， dpo data_collocator里面没有fake_inputs行为，所以z3的时候会因为设备上不同的梯度信息导致会hang住 @wwfnb

@Kuangdd01 好的，但是可能需要明天才能看到结果，现在没有可以用的GPU。

May 25 '25 08:05 wwfnb

How to train qwen omni with other languages?

Regarding issues such as speech tokenzie, ... is it necessary to extend vocab? and how to supplement discrete units for other languages?

Can anyone help me with a solution?

May 25 '25 14:05 phanxuanphucnd

How to train qwen omni with other languages?

Regarding issues such as speech tokenzie, ... is it necessary to extend vocab? and how to supplement discrete units for other languages?

Can anyone help me with a solution?

If I don't misunderstand, your question is similar to this.

May 25 '25 17:05 Kuangdd01

我也遇到了同样的问题：使用音频+文本混合数据训练，采用 deepspeed zero3，在训练开始时卡住不动，GPU利用率是100%。但采用 deepspeed zero2 能正常训练。目前试了很多 deepspeed版本，均无法正常使用 zero3。

Jun 11 '25 08:06 wulaoshi

我也遇到了同样的问题：使用音频+文本混合数据训练，采用 deepspeed zero3，在训练开始时卡住不动，GPU利用率是100%。但采用 deepspeed zero2 能正常训练。目前试了很多 deepspeed版本，均无法正常使用 zero3。

dpo吗

Jun 11 '25 08:06 Kuangdd01

我也遇到了同样的问题：使用音频+文本混合数据训练，采用 deepspeed zero3，在训练开始时卡住不动，GPU利用率是100%。但采用 deepspeed zero2 能正常训练。目前试了很多 deepspeed版本，均无法正常使用 zero3。

dpo吗

不是dpo，是全量 sft。

Jun 11 '25 09:06 wulaoshi

你发下配置看看，full sft zero3我记得我是测试过的，没有hanging的问题，已知的是lora zero3会hang

Jun 11 '25 11:06 Kuangdd01

我也遇到了同样的问题：使用音频+文本混合数据训练，采用 deepspeed zero3，在训练开始时卡住不动，GPU利用率是100%。但采用 deepspeed zero2 能正常训练。目前试了很多 deepspeed版本，均无法正常使用 zero3。

dpo吗

不是dpo，是全量 sft。

我之前在全量sft混合数据训练时是没有hang的，是后面dpo时会hang住

Jun 11 '25 11:06 wwfnb

晚点我复现一下

Jun 11 '25 12:06 Kuangdd01

你发下配置看看，full sft zero3我记得我是测试过的，没有hanging的问题，已知的是lora zero3会hang

平台: CUDA Version: 12.4 ， A100*2

环境关键包:

accelerate 1.7.0
accelerator 2024.9.13
datasets 3.5.1
deepspeed 0.16.4
fastapi 0.115.8
fastapi-cli 0.0.7
fastjsonschema 2.19.1
fastrlock 0.8.3
flash_attn 2.7.4.post1
gguf 0.10.0
gradio 5.12.0
gradio_client 1.5.4
huggingface-hub 0.30.2
humanfriendly 10.0
Jinja2 3.1.6
lazy_loader 0.4
librosa 0.10.2.post1
liger_kernel 0.5.9
llamafactory 0.9.3.dev0
ninja 1.11.1.1
numba 0.61.0
numpy 1.26.4
nvidia-cublas-cu12 12.4.5.8
nvidia-cuda-cupti-cu12 12.4.127
nvidia-cuda-nvrtc-cu12 12.4.127
nvidia-cuda-runtime-cu12 12.4.127
nvidia-cudnn-cu12 9.1.0.70
nvidia-cufft-cu12 11.2.1.3
nvidia-curand-cu12 10.3.5.147
nvidia-cusolver-cu12 11.6.1.9
nvidia-cusparse-cu12 12.3.1.170
nvidia-cusparselt-cu12 0.6.2
nvidia-ml-py 12.570.86
nvidia-nccl-cu12 2.21.5
nvidia-nvjitlink-cu12 12.4.127
nvidia-nvtx-cu12 12.4.127
torch 2.6.0
torchaudio 2.6.0
torchvision 0.21.0
tornado 6.4
transformers 4.52.4
transformers-stream-generator 0.0.5
triton 3.2.0
trl 0.9.6
vllm 0.8.3
xformers 0.0.29.post2
xgrammar 0.1.17

启动yaml:
### model
model_name_or_path: /data/models/Qwen/Qwen2.5-Omni-7B
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true

### method
stage: sft
do_train: true
finetuning_type: full
freeze_vision_tower: true
freeze_multi_modal_projector: true
freeze_language_model: false
deepspeed: examples/deepspeed/ds_z3_config.json
enable_liger_kernel: True

### dataset
dataset: test_audio_v2
template: qwen2_omni
cutoff_len: 10240
max_samples: 10000
overwrite_cache: true
preprocessing_num_workers: 128
dataloader_num_workers: 16

### output
output_dir: saves/qwen2_omni-7b/full/sft_v6_test
logging_steps: 1
save_steps: 200     # 350
plot_loss: true
overwrite_output_dir: true
save_only_model: true
report_to: swanlab  # choices: [none, wandb, tensorboard, swanlab, mlflow]

### train
per_device_train_batch_size: 1
gradient_accumulation_steps: 16
learning_rate: 2.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.0
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
use_cache: false
flash_attn: fa2   # 启动 FlashAttention-2

同时，为了避免报device错误，在底层transformers包的模型代码文件3791行处添加代码 kaiser_window = kaiser_window.to(sinc_filter.device) 文件路径: 依赖包地址/transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py

Jun 12 '25 02:06 wulaoshi

我也遇到了同样的问题：使用音频+文本混合数据训练，采用 deepspeed zero3，在训练开始时卡住不动，GPU利用率是100%。但采用 deepspeed zero2 能正常训练。目前试了很多 deepspeed版本，均无法正常使用 zero3。

dpo吗

不是dpo，是全量 sft。

我之前在全量sft混合数据训练时是没有hang的，是后面dpo时会hang住

我俩差别，主要是transformers与deepspeed版本，我刚把 deepspeed升级到0.16.7，还是会hang；我之前transformers的版本和你一样，还是会hang。

Jun 12 '25 02:06 wulaoshi

有人解决这个问题了吗？

Jun 25 '25 09:06 aleien95

我复现了一下这个问题，发现会在_conv_forward (torch/nn/modules/conv.py)这里hang住，不知道你们情况是否一致，我是image+audio dpo data的mixup @wwfnb @aleien95 @wulaoshi

Jun 25 '25 10:06 Kuangdd01

我复现了一下这个问题，发现会在_conv_forward (torch/nn/modules/conv.py)这里hang住，不知道你们情况是否一致，我是image+audio dpo data的mixup @wwfnb @aleien95 @wulaoshi

我是在 def _invoke_run: while True: (torch/distributed/elastic/agent/server/api.py) 下hang住，感觉进程一直在这个while循环里。

Jul 24 '25 01:07 wulaoshi

dpo hang住解决了吗请问

Jul 29 '25 03:07 Qinnns

我也遇到了同样的问题：使用音频+文本混合数据训练，采用 deepspeed zero3，在训练开始时卡住不动，GPU利用率是100%。但采用 deepspeed zero2 能正常训练。目前试了很多 deepspeed版本，均无法正常使用 zero3。

dpo吗

不是dpo，是全量 sft。

同SFT卡住，请问解决了吗？

Sep 11 '25 17:09 Eureka-Maggie

我复现了一下这个问题，发现会在_conv_forward (torch/nn/modules/conv.py)这里hang住，不知道你们情况是否一致，我是image+audio dpo data的mixup @wwfnb @aleien95 @wulaoshi

解决方案参考：https://github.com/modelscope/ms-swift/issues/5938

Sep 25 '25 01:09 wulaoshi