Hanging problem when LoRA finte-tune Qwen2.5Omni with multi-turn video-audio samples with deepspeedz3
Reminder
- [x] I have read the above rules and searched the existing issues.
System Info
llamafactoryversion: 0.9.3.dev0- Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
- Python version: 3.12.9
- PyTorch version: 2.6.0+cu118 (GPU)
- Transformers version: 4.50.0.dev0
- Datasets version: 3.4.1
- Accelerate version: 1.5.2
- PEFT version: 0.15.1
- TRL version: 0.9.6
- GPU type: NVIDIA GeForce RTX 4090
- GPU number: 8
- GPU memory: 23.65GB
- DeepSpeed version: 0.16.4
- vLLM version: 0.8.1
Reproduction
When I lora sft Qwen2.5Omni with special tokens, it would always get stuck here for more than a whole day:
swanlab: 🚀 View run at https://swanlab.cn/@Luffy/llamafactory/runs/...
0%| | 0/3840 [00:00<?, ?it/s]
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
[WARNING|logging.py:329] 2025-04-19 04:32:26,803 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
or just abort with a timeout error sometimes (would not always triggered):
Watchdog caught collective operation timeout: WorkNCCL(SeqNum=17, OpType=_REDUCE_SCATTER_BASE, NumelIn=544997376, NumelOut=272498688, Timeout(ms)=600000) ran for 600006 milliseconds before timing out.
My config.yaml:
### model
model_name_or_path: "/models/Huggingface_download/Qwen2.5-Omni-7B"
trust_remote_code: true
new_special_tokens: "<facial_expression>,</facial_expression>,<body_movement>,</body_movement>,<speech_prompt>,</speech_prompt>,<content>,</content>"
skip_special_tokens: false
# flash_attn: False
torch_empty_cache_steps: 1 # 任意可整除global_step的值,显存不够时可用
resize_vocab: true
### 降低分辨率
image_max_pixels: 50176 #50176 #15680 # 原262144,200704=256*28*28, 31360=40*28*28, 23520=30*28*28, 15680=20*28*28,
video_max_pixels: 100352 #100352 #15680 # 原16384(128*128),65536=256*256, 39200=50*28*28, 78400=100*28*28, 100352=128*28*28
video_fps: 2.0 # 默认2.0
video_maxlen: 38 # 视频输入最大总帧数 # 原版是针对单视频的,后修改源码改成了按视频时长分配maxlen # vanno=49, vchat=48
### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 16 # 数据多就多给点
lora_alpha: 32 # 数据多就多给点
lora_target: all
additional_target: "embed_tokens,lm_head"
lora_dropout: 0.05
deepspeed: llamafactory_train/ds_z3_config.json # 在ds=0.16.4和0.15.0能运行,在ds=0.16.5会报错 # ds_z3_config_offload
enable_liger_kernel: True # 在ds=0.16.4和0.15.0能运行,在ds=0.16.5会报错
use_unsloth_gc: True # 在ds=0.16.4和0.15.0能运行,在ds=0.16.5会报错
### dataset
dataset_dir: data/final_dataset # dataset_info.json所在目录,customized
dataset: "CPED_RE_omni_chat_train" # video_annotation, audio_chat, audio_annotation, text_chat, omni_chat
template: qwen2_omni
cutoff_len: 8192 # 14800 # 2048 # 16384
# max_samples: 10 # for DEBUG
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4
mask_history: True # 仅使用当前对话轮次训练(非多轮训练)
### output
output_dir: "/models/checkpoints/Qwen2.5-Omni-7B_train_lora_omni_chat_2025-04-19-03-52-19"
logging_steps: 10 # 每隔多少步输出loss等日志(将打印到swanlab中)
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
### qwen_omni train
use_audio_in_video: true
freeze_vision_tower: true
# freeze_multi_modal_projector: true
# freeze_language_model: false
### train
per_device_train_batch_size: 1 # 每个设备(如 GPU)上的训练批次大小,适合显存有限的设备。
gradient_accumulation_steps: 8 # 4卡用4,2卡用8 # 梯度累积步数为n,这意味着每n个小批次的梯度将累积后再进行一次更新,等效于增加了有效批次大小。 #8
learning_rate: 1.0e-5
num_train_epochs: 4.0 # 原3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
# bf16: true
fp16: true
ddp_timeout: 180000000 # DDP(分布式数据并行)超时设置,确保在分布式环境下不会因为超时导致训练中断。
# resume_from_checkpoint: null # 指定checkpoint路径即可断点续训
### eval
val_size: 0.1 # 验证集占数据集的比例为 0.1,表示使用 10% 的数据集进行验证。
per_device_eval_batch_size: 1 # 每个设备上评估的批次大小为 1,与训练的批次大小一致。
eval_strategy: steps # 评估策略为按步评估,意味着每隔一定步数进行一次评估。
eval_steps: 500 # 每 n 步进行一次评估,确保训练期间可以监控模型性能。建议与save_steps相同 # 除了这个,在训练结束后还会自动eval一次
# swanlab
use_swanlab: true
swanlab_project: llamafactory # llamafactory for debug
swanlab_run_name: "Qwen2.5-Omni-7B_train_lora_omni_chat_2025-04-19-03-52-19"
# swanlab_workspace: your_workspace
# swanlab_mode: cloud
Others
No response
sft with only 2*4090
Hi, @Kuangdd01 Could you please do me a favor by sparing some time to watch this?
I reproduced your exp with following config without catching this issue.
hardware env: 2V100
### model
model_name_or_path: ./Qwen2.5-Omni-7B
image_max_pixels: 262144
video_max_pixels: 16384
trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 8
lora_target: all
deepspeed: ./examples/deepspeed/ds_z3_config.json
### dataset
dataset: mllm_video_audio_demo
template: qwen2_omni
cutoff_len: 8192
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4
### output
output_dir: saves/qwen2_omni-7b-video/lora/sft
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
### train
use_audio_in_video: true
per_device_train_batch_size: 1
gradient_accumulation_steps: 4
freeze_vision_tower: true
learning_rate: 1.0e-4
num_train_epochs: 25.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
fp16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
Can you provide more info of your setting? Such as your dataset info etc.
Thanks for replying. I can confirm that this issue happens when I add 2 images to my sys_msg, because I can also run well with mllm_video_audio_demo.
Note that for handling mmdata in sys_msg, I've adapted these changes #7694
But I can run well with qwen2.5vl with those changes. Could you help me check this?
BTW, for some reason I couldn't share my dataset, but I can reproduce this issue with a slightly modified sample of mllm_video_audio_demo like:
[
{
"messages": [
{
"content": "<video><audio>What is the video describing?",
"role": "system"
},
{
"content": "<video><audio>What is the video describing?",
"role": "user"
},
{
"content": "A girl who is drawing a picture of a guitar and feel nervous.",
"role": "assistant"
}
],
"videos": [
"mllm_demo_data/4.mp4",
"mllm_demo_data/4.mp4"
],
"audios": [
"mllm_demo_data/4.mp3",
"mllm_demo_data/4.mp3"
]
}
]
Save the sample above into mllm_sys_video_audio_demo.json.
And remember to add this mllm_sys_video_audio_demo to the dataset_info.json:
"mllm_sys_video_audio_demo": {
"file_name": "mllm_sys_video_audio_demo.json",
"formatting": "sharegpt",
"columns": {
"messages": "messages",
"videos": "videos",
"audios": "audios"
},
"tags": {
"role_tag": "role",
"content_tag": "content",
"user_tag": "user",
"assistant_tag": "assistant",
"system_tag": "system"
}
}
I double check my code changes and the data sample printed after dataset loaded, and I can't find any mistakes. Since it only stuck in training stage, I guess maybe it's probably nothing to do with my changes here #7694?
I am unfamiliar with MLLM role-playing, but I guess it is inconsistent with model training because these mllm models use a fixed textual system prompt (see their chat templates). Can you explain why we can't add this multimodal data to 'user'?
For multi-round multi-modal role-playing chat, system is the most intuitive place to add the role profile and role image if you want an LLM to act as a role. And it also keeps the same format as text LLMs.
For example, the sys_msg contains a role image an MLLM needs to act as, as well as another role image this MLLM is going to talk to. An example system message is like:
You need to act as {sys_role_name}, the role information is
{sys_role_profile}
Your role image is:
{sys_role_image}
And I would act as {usr_role_name}. My role image is:
{usr_role_image}
In our dialogue, I would input a video with audio every turn, and ...
Now dialogue begins.
Thanks for your explanation. I will check it later. I just added some prefix images in user column and restart this experiment without catching this. I think it should be equivalent to your case.
An alternative way is that, we put this role instruction on the first user message of a dialogue, and assistant reply "OK". And then the real chat begins from second round of this dialogue.
But I'm just wondering how could I modify codes to fit mmdata in sys_msg during training.😂
Thanks for your explanation. I will check it later. I just added some prefix images in user column and restart this experiment without catching this. I think it should be equivalent to your case.
Thank you so much for your consideration!
An alternative way is that, we put this role instruction on the first
usermessage of a dialogue, andassistantreply "OK". And then the real chat begins from second round of this dialogue. But I'm just wondering how could I modify codes to fit mmdata in sys_msg during training.😂
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_bos|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><
|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE
|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IMAGE|><|IM
AGE|><|IMAGE|><|IMAGE|><|vision_eos|>Who are they? <|vision_bos|><|audio_bos|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|>
<|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDE
O|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|A
UDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|>
<|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDI
O|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|V
IDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|>
<|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDI
O|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|A
UDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|>
<|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|VIDEO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDI
O|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|AUDIO|><|audio_eos|><|vision_eos|>What is the video describing?<|im_end|> <|im_start|>assistant
A girl who is drawing a picture of a guitar and feel nervous.<|im_end|>
Tokenized case below. Can you show your printed case?
Thank you for your experiment. But I still want to know how to achieve mmdata in sys_msg when training.😂 Is it too hard? If it's too hard, then I'll consider this plan B:
An alternative way is that, we put this role instruction on the first
usermessage of a dialogue, andassistantreply "OK". And then the real chat begins from second round of this dialogue. But I'm just wondering how could I modify codes to fit mmdata in sys_msg during training.😂
Thank you for your experiment. But I still want to know how to achieve mmdata in sys_msg when training.😂 Is it too hard? If it's too hard, then I'll consider this plan B:
An alternative way is that, we put this role instruction on the first
usermessage of a dialogue, andassistantreply "OK". And then the real chat begins from second round of this dialogue. But I'm just wondering how could I modify codes to fit mmdata in sys_msg during training.😂
reproduce it later
Thanks, and I've also reproduced plan B and it runs pretty well.
Now the problem is, I'm dying for the new feature that qwen2.5omni could support mmdata in sys_msg during training. Could you please just give me some hints to achieve this if possible? And I could do it on my own if this feature is not planned by llamafactory.
BTW, for some reason I couldn't share my dataset, but I can reproduce this issue with a slightly modified sample of
mllm_video_audio_demolike:[ { "messages": [
{ "content": "<video><audio>What is the video describing?", "role": "system" }, { "content": "<video><audio>What is the video describing?", "role": "user" }, { "content": "A girl who is drawing a picture of a guitar and feel nervous.", "role": "assistant" } ], "videos": [ "mllm_demo_data/4.mp4", "mllm_demo_data/4.mp4" ], "audios": [ "mllm_demo_data/4.mp3", "mllm_demo_data/4.mp3" ]} ] Save the sample above into
mllm_sys_video_audio_demo.json. And remember to add thismllm_sys_video_audio_demoto thedataset_info.json:"mllm_sys_video_audio_demo": { "file_name": "mllm_sys_video_audio_demo.json", "formatting": "sharegpt", "columns": { "messages": "messages", "videos": "videos", "audios": "audios" }, "tags": { "role_tag": "role", "content_tag": "content", "user_tag": "user", "assistant_tag": "assistant", "system_tag": "system" } }
I rechecked this on my machine and didn't catch the hanging problem with PR #7694 either. TBH, I forgot to set CUDA_HOME and the runtime cuda version didn't match back then, which would also cause the hanging problem.
I'm now converting my videos into vision_only_videos following #7638. And I guess the hanging problem with my own dataset is mainly caused by this. Hope I will come back with good news.
I'm now converting my videos into vision_only_videos following #7638. And I guess the hanging problem with my own dataset is mainly caused by this. Hope I will come back with good news.
Feel free to reopen this issue if some issues remain.
Hi, @Kuangdd01 I still met the hanging problem with multi-turn video-audio sample, no matter whether the system contains mmdata, like:
[
{
"system": "What is the video describing?",
"instruction": "What does this girl say?\n<video><audio>",
"output": "She says: 'Hello! Take a look at what am I drawing!'",
"history": [
[
"What does this girl say?\n<video><audio>",
"She says: 'Hello! Take a look at what am I drawing!'"
]
],
"videos": [
"mllm_demo_data/4.mp4",
"mllm_demo_data/4.mp4"
],
"audios": [
"mllm_demo_data/4.mp3",
"mllm_demo_data/4.mp3"
]
},
{
"system": "<image>What is the video describing?<image>",
"instruction": "What does this girl say?\n<video><audio>",
"output": "She says: 'Hello! Take a look at what am I drawing!'",
"history": [
[
"What does this girl say?\n<video><audio>",
"She says: 'Hello! Take a look at what am I drawing!'"
]
],
"images": [
"mllm_demo_data/1.jpg",
"mllm_demo_data/1.jpg"
],
"videos": [
"mllm_demo_data/4.mp4",
"mllm_demo_data/4.mp4"
],
"audios": [
"mllm_demo_data/4.mp3",
"mllm_demo_data/4.mp3"
]
}
]
remember to add it into dataset_info.json:
"mllm_mmsys_multiturn_video_audio_demo_alpaca": {
"file_name": "mllm_mmsys_multiturn_video_audio_demo_alpaca.json",
"columns": {
"prompt": "instruction",
"response": "output",
"system": "system",
"videos": "videos",
"audios": "audios",
"images": "images",
"history": "history"
}
},
I've confirmed that single-turn video-audio sample with images in sys didn't catch the hanging problem.
{
"system": "<image>What is the video describing?<image>",
"instruction": "What does this girl say?\n<video><audio>",
"output": "She says: 'Hello! Take a look at what am I drawing!'",
"history": [],
"images": [
"mllm_demo_data/1.jpg",
"mllm_demo_data/1.jpg"
],
"videos": [
"mllm_demo_data/4.mp4"
],
"audios": [
"mllm_demo_data/4.mp3"
]
}
BTW, could you help reopen this issue? 'Cause I'm informed that I don't have permissions to reopen it.
Can you check that your cutoff_len is big enough?
I've set cutoff_len=16384. And I also added additional print in print_data_example() function, and got len(input_ids)=971 for the first sample I gave.
IIRC, cutoff_len would function in data preprocessing stage, wouldn't it? If input is too large, it would directly cut it off before training stage. And now I can see a full input sample in the output of print_data_example(). Then it stuck in training stage. So maybe that's not where the problem lies?
And I've tried training Qwen2.5Omni with multi-turn video samples setting use_audio_in_video=false, which ran pretty well.
Can we retry with disabled deepspeed? As I observed, peak GPU memory consumption is about 25 GB with bs=4. dataset: your demo data repeated twice. configs
### model
model_name_or_path: ../models/Qwen2.5-Omni-7B
image_max_pixels: 262144
video_max_pixels: 16384
# trust_remote_code: true
### method
stage: sft
do_train: true
finetuning_type: lora
lora_rank: 16
lora_target: all
# deepspeed: ./examples/deepspeed/ds_z3_config.json
### dataset
dataset: test
template: qwen2_omni
cutoff_len: 8192
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4
### output
output_dir: saves/test-qwen/lora/sft
logging_steps: 1
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
### train
use_audio_in_video: true
per_device_train_batch_size: 4
gradient_accumulation_steps: 4
freeze_vision_tower: true
learning_rate: 1.0e-4
num_train_epochs: 25.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
ddp_timeout: 180000000
resume_from_checkpoint: null
Can we retry with disabled deepspeed? As I observed, peak GPU memory consumption is about 25 GB with bs=4. dataset: your demo data repeated twice. configs
model
model_name_or_path: ../models/Qwen2.5-Omni-7B image_max_pixels: 262144 video_max_pixels: 16384
trust_remote_code: true
method
stage: sft do_train: true finetuning_type: lora lora_rank: 16 lora_target: all
deepspeed: ./examples/deepspeed/ds_z3_config.json
dataset
dataset: test template: qwen2_omni cutoff_len: 8192 max_samples: 1000 overwrite_cache: true preprocessing_num_workers: 16 dataloader_num_workers: 4
output
output_dir: saves/test-qwen/lora/sft logging_steps: 1 save_steps: 500 plot_loss: true overwrite_output_dir: true save_only_model: false
train
use_audio_in_video: true per_device_train_batch_size: 4 gradient_accumulation_steps: 4 freeze_vision_tower: true learning_rate: 1.0e-4 num_train_epochs: 25.0 lr_scheduler_type: cosine warmup_ratio: 0.1 bf16: true ddp_timeout: 180000000 resume_from_checkpoint: null
Not available on my 4090 24G.
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.03 GiB. GPU 0 has a total capacity of 23.65 GiB of which 26.06 MiB is free. Including non-PyTorch memory, this process has 23.62 GiB memory in use. Of the allocated memory 19.87 GiB is allocated by PyTorch, and 3.31 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
Is ds the main reason for this issue? Do we have some other ways, like downgrade the version of ds? My ds version is 0.16.4.
Set bs=1 but still OOM. I'm using 1 4090 GPU, since it doesn't have any data parallel or model parallel if disabled ds.
Indeed, I have reproduced your case with only DeepSpeed ZeRO Stage 3. Works fine in full fine-tune with Zero3.
Indeed, I have reproduced your case with only DeepSpeed ZeRO Stage 3.
Same as mine!
Still hanging on my own dataset with ds-z3-offload but fine with mllm_mmsys_multiturn_video_audio_demo_alpaca I gave above.
When I set gradient_accumulation_steps=1, I found these infos from ds in the first 2 steps (warmup_ratio=0.1 and lr=1e-5), and then the training hangs.
0%| | 0/18 [00:00<?, ?it/s]
`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
[WARNING|logging.py:329] 2025-04-24 06:50:47,765 >> `use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...
[2025-04-24 06:50:57,820] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1
[2025-04-24 06:50:57,822] [INFO] [stage3.py:2017:_loco_err_buf_update] update loco-zero++ error buffer with overflow: True
6%|█████████▋ | 1/18 [00:15<04:17, 15.13s/it]
[2025-04-24 06:51:12,613] [INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, reducing to 32768
[2025-04-24 06:51:12,614] [INFO] [stage3.py:2017:_loco_err_buf_update] update loco-zero++ error buffer with overflow: True
11%|███████████████████▎ | 2/18 [00:29<03:58, 14.92s/it]
Does these indicate that it may have a Gradients Exploding or Gradients Vanishing? But why wouldn't it happen if I fed vision-only-video samples (no audios) with higher gradient_accumulation_steps?
I've run py-spy on the hanging training subprocesses (2 GPU activated).
The output of `py-spy dump -p PID_GPU_0`
Thread 7344 (active): "MainThread"
synchronize (torch/cuda/streams.py:224)
fetch_sub_module (deepspeed/runtime/zero/partitioned_param_coordinator.py:332)
decorate_context (torch/utils/_contextlib.py:116)
wrapped_fn (deepspeed/utils/nvtx.py:18)
_fn (torch/_dynamo/eval_frame.py:745)
pre_sub_module_forward_function (deepspeed/runtime/zero/parameter_offload.py:467)
decorate_context (torch/utils/_contextlib.py:116)
_pre_forward_module_hook (deepspeed/runtime/zero/parameter_offload.py:292)
wrapped_fn (deepspeed/utils/nvtx.py:18)
inner (torch/nn/modules/module.py:1785)
_call_impl (torch/nn/modules/module.py:1848)
_wrapped_call_impl (torch/nn/modules/module.py:1739)
forward (peft/tuners/lora/layer.py:727)
inner (torch/nn/modules/module.py:1796)
_call_impl (torch/nn/modules/module.py:1848)
_wrapped_call_impl (torch/nn/modules/module.py:1739)
forward (transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py:1741)
inner (torch/nn/modules/module.py:1796)
_call_impl (torch/nn/modules/module.py:1848)
_wrapped_call_impl (torch/nn/modules/module.py:1739)
forward (transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py:2071)
inner (torch/nn/modules/module.py:1796)
_call_impl (torch/nn/modules/module.py:1848)
_wrapped_call_impl (torch/nn/modules/module.py:1739)
forward (torch/utils/checkpoint.py:264)
apply (torch/autograd/function.py:575)
_wrapped_call_impl (torch/nn/modules/module.py:1739)
compute_loss (transformers/trainer.py:3791)
compute_loss (llamafactory/train/sft/trainer.py:103)
training_step (transformers/trainer.py:3726)
_inner_training_loop (transformers/trainer.py:2564)
train (transformers/trainer.py:2253)
run_sft (llamafactory/train/sft/workflow.py:96)
_training_function (llamafactory/train/tuner.py:72)
run_exp (llamafactory/train/tuner.py:110)
launch (llamafactory/launcher.py:44)
<module> (llamafactory/launcher.py:48)
Thread 7571 (idle): "Thread-3"
wait (threading.py:359)
wait (threading.py:655)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
Thread 8274 (idle): "Thread-10"
wait (threading.py:359)
wait (threading.py:655)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
Thread 13762 (idle): "MsgUploader"
new_task (swanlab/data/cloud/start_thread.py:120)
run (threading.py:1012)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
Thread 13871 (idle): "Thread-16 (_pin_memory_loop)"
select (selectors.py:415)
wait (multiprocessing/connection.py:1136)
_poll (multiprocessing/connection.py:440)
poll (multiprocessing/connection.py:257)
get (multiprocessing/queues.py:113)
do_one_step (torch/utils/data/_utils/pin_memory.py:35) [0/102]
_pin_memory_loop (torch/utils/data/_utils/pin_memory.py:59)
run (threading.py:1012)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
Thread 13872 (idle): "QueueFeederThread"
wait (threading.py:355)
_feed (multiprocessing/queues.py:251)
run (threading.py:1012)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
Thread 13873 (idle): "QueueFeederThread"
wait (threading.py:355)
_feed (multiprocessing/queues.py:251)
run (threading.py:1012)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
Thread 13874 (idle): "QueueFeederThread"
wait (threading.py:355)
_feed (multiprocessing/queues.py:251)
run (threading.py:1012)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
Thread 13875 (idle): "QueueFeederThread"
wait (threading.py:355)
_feed (multiprocessing/queues.py:251)
run (threading.py:1012)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
Thread 14203 (idle)
Thread 14204 (idle)
Thread 15528 (idle): "Thread-24"
wait (threading.py:359)
wait (threading.py:655)
run (threading.py:1431)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
The output of `py-spy dump -p PID_GPU_1`
Thread 7345 (active): "MainThread"
_all_gather_dtype (deepspeed/runtime/zero/partition_parameters.py:1186)
all_gather_coalesced (deepspeed/runtime/zero/partition_parameters.py:1320)
wrapped_fn (deepspeed/utils/nvtx.py:18)
__all_gather_params_ (deepspeed/runtime/zero/partitioned_param_coordinator.py:500)
__all_gather_params (deepspeed/runtime/zero/partitioned_param_coordinator.py:471)
wrapped_fn (deepspeed/utils/nvtx.py:18)
fetch_sub_module (deepspeed/runtime/zero/partitioned_param_coordinator.py:314)
decorate_context (torch/utils/_contextlib.py:116)
wrapped_fn (deepspeed/utils/nvtx.py:18)
_fn (torch/_dynamo/eval_frame.py:745)
pre_sub_module_forward_function (deepspeed/runtime/zero/parameter_offload.py:467)
decorate_context (torch/utils/_contextlib.py:116)
_pre_forward_module_hook (deepspeed/runtime/zero/parameter_offload.py:292)
wrapped_fn (deepspeed/utils/nvtx.py:18)
inner (torch/nn/modules/module.py:1785)
_call_impl (torch/nn/modules/module.py:1848)
_wrapped_call_impl (torch/nn/modules/module.py:1739)
forward (transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py:1196)
inner (torch/nn/modules/module.py:1796)
_call_impl (torch/nn/modules/module.py:1848)
_wrapped_call_impl (torch/nn/modules/module.py:1739)
forward (transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py:1339)
inner (torch/nn/modules/module.py:1796)
_call_impl (torch/nn/modules/module.py:1848)
_wrapped_call_impl (torch/nn/modules/module.py:1739)
custom_gradient_checkpointing_func (llamafactory/model/model_utils/checkpointing.py:99)
forward (transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py:1548)
inner (torch/nn/modules/module.py:1796)
_call_impl (torch/nn/modules/module.py:1848)
_wrapped_call_impl (torch/nn/modules/module.py:1739)
forward (transformers/models/qwen2_5_omni/modeling_qwen2_5_omni.py:2669)
forward (peft/tuners/tuners_utils.py:193)
inner (torch/nn/modules/module.py:1796)
_call_impl (torch/nn/modules/module.py:1848)
_bootstrap (threading.py:1032)
Thread 8273 (idle): "Thread-4"
wait (threading.py:359)
wait (threading.py:655)
run (tqdm/_monitor.py:60)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
Thread 13757 (idle): "Thread-5 (_pin_memory_loop)"
select (selectors.py:415)
wait (multiprocessing/connection.py:1136)
_poll (multiprocessing/connection.py:440)
poll (multiprocessing/connection.py:257)
get (multiprocessing/queues.py:113)
do_one_step (torch/utils/data/_utils/pin_memory.py:35)
_pin_memory_loop (torch/utils/data/_utils/pin_memory.py:59)
run (threading.py:1012)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
Thread 13758 (idle): "QueueFeederThread"
wait (threading.py:355)
_feed (multiprocessing/queues.py:251)
run (threading.py:1012)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
Thread 13759 (idle): "QueueFeederThread"
wait (threading.py:355)
_feed (multiprocessing/queues.py:251)
run (threading.py:1012)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
Thread 13760 (idle): "QueueFeederThread"
wait (threading.py:355)
_feed (multiprocessing/queues.py:251)
run (threading.py:1012)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
Thread 13761 (idle): "QueueFeederThread"
wait (threading.py:355)
_feed (multiprocessing/queues.py:251)
run (threading.py:1012)
_bootstrap_inner (threading.py:1075)
_bootstrap (threading.py:1032)
Thread 14202 (idle)
Thread 14201 (idle)