Description

🎉! We supported the SFT training of Qwen2.5-Omni within 1 hour! Here are the specific training screenshots👇

Test

Please test your changes by running the following command:

cd scripts
bash test/test_text_to_text.sh ./opt PATH_TO_OUTPUT_ROOT_DIR

Here, ./opt is the directory containing the test scripts for the opt model, and PATH_TO_OUTPUT_ROOT_DIR is the path to the output root directory. The test scripts will save ~1GB data to the output root directory and delete it after the test. Please ensure you have enough space on your disk.

Lint

Please run the following command in the root directory to check your code style:

pip install pre-commit
pre-commit run --all-files

Types of changes

What types of changes does your code introduce? Put an x in all the boxes that apply:

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds core functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to change)
[ ] Documentation (update in the documentation)

Checklist

Go over all the following points, and put an x in all the boxes that apply. If you are unsure about any of these, don't hesitate to ask. We are here to help!

[x] I have read the CONTRIBUTION guide. (required)
[x] My change requires a change to the documentation.
[x] I have updated the tests accordingly. (required for a bug fix or a new feature)
[x] I have updated the documentation accordingly.

Mar 26 '25 18:03 Gaiejj

可以支持t2s吗

Mar 27 '25 06:03 deyituo

Talker模块我们正在加紧研究，今明应该可以弄好text-audio输入的微调～

Mar 27 '25 09:03 Gaiejj

有计划支持三模态的全量微调吗（文本system prompt，图片，语音指令）

Mar 27 '25 09:03 DQYZHWK

@DQYZHWK 这个事情我们非常感兴趣做，但是苦于没有相应的数据，您有参考不

Mar 27 '25 11:03 Gaiejj

@Gaiejj 请问支持音频和图片一起训练吗，就是一个batch里既有语音又有图片这种

Mar 27 '25 12:03 Alex-Songs

@DQYZHWK 这个事情我们非常感兴趣做，但是苦于没有相应的数据，您有参考不很抱歉，我没有相关的数据集。 https://mp.weixin.qq.com/s/hJ5x8xUstBjwNZc1mmqE-g 但是可以参考这篇文章，您可以使用VQA数据集通过tts (chattts,fishspeech)转化成SQA数据集。期待未来能集成此demo。

Mar 27 '25 13:03 DQYZHWK

@DQYZHWK @Alex-Songs 感谢推荐，我们近期会尝试尝试这种三模态微调！

Mar 27 '25 13:03 Gaiejj

请问, 微调qwen2.5-omni的脚本被移除了吗?

Mar 28 '25 09:03 zuitbjc1096

@zuitbjc1096 您好，代码在这里：https://github.com/Gaiejj/align-anything/tree/dev-omni

Mar 28 '25 11:03 Gaiejj

@Gaiejj 好像qwen2.5-omni用的transformers库加了个tp_plan，需要torch>=2.5，目前微调代码也需要torch>=2.5吗？

Mar 29 '25 13:03 Alex-Songs

@Alex-Songs 是的，需要遵从Qwen-2.5-Omni的官方依赖～

Mar 31 '25 08:03 Gaiejj

@Gaiejj 大佬，再问下qwen2.5-omni-7b的权重是thinker.visual.blocks.11.attn.proj.weight，直接加载thinker的话需要改成visual.blocks.11.attn.proj.weight吗？

Mar 31 '25 13:03 Alex-Songs

大佬，想咨询下，TMRoPE 没有实现的基础上可以直接微调这个模型吗？谢谢

Apr 02 '25 08:04 shanhaidexiamo

hello大佬，我看了下qwen2.5-omni的code，如果需要训练talker，构造训练数据时需要语音tokenizer先对语音数据做tokenize转成语音的codec id，但是它似乎没开源语音tokenizer，想问下你们这里是怎么处理的

Apr 02 '25 13:04 liu6381810

其实不是大佬orz，最近实现的时候也遇到了这些问题，感觉 @Alex-Songs 的说法是对的，我们之后有进展了会第一时间在这里更新～

Apr 02 '25 14:04 Gaiejj

@Alex-Songs 是的，需要遵从Qwen-2.5-Omni的官方依赖～

tp_plan这个参数导致pretrain model的时候会报错“raise NotImplementedError("This model does not have a tensor parallel plan.")”请问有遇到过吗

Apr 07 '25 13:04 sky1170447398

请问有复现指南吗，我们可以帮忙解决一下

Apr 07 '25 16:04 Gaiejj

请问现在是否支持视频+音频（视频里的音频）+prompt的微调呢，我看官方代码里面给的是图片+prompt嘞，非常感谢

Apr 08 '25 03:04 jiahui-w

Talker模块我们正在加紧研究，今明应该可以弄好text-audio输入的微调～

talker模块现在可以支持更换音色吗比如用一些其他的音色微调

Apr 08 '25 09:04 Kingdroper

mark

Apr 10 '25 06:04 zzchust

Like @liu6381810 mentioned, I faced the same issue and posted about it [here](https://huggingface.co/Qwen/Qwen2.5-Omni-7B/discussions/40) for the author's attention. However, there might be a reason why they haven't disclosed their speech tokenizer. Given that, I'm not currently expecting them to release it. It seems we'll likely need to train the talker component from scratch using our own voice data.

Apr 15 '25 05:04 SeungyounShin

请问talker部分的微调有进展吗

Apr 28 '25 04:04 pjgao

CosyVoice also uses a speech tokenizer architecture. Maybe we can refer to it.

Apr 30 '25 07:04 dongkeun-livetoon

这个现在是支持（system prompt +文本指令+语音 -->text)的微调吗

May 20 '25 07:05 wwfcnu

Talker模块我们正在加紧研究，今明应该可以弄好text-audio输入的微调～

我看代码里只有text-image输入的微调

May 20 '25 07:05 wwfcnu

Talker模块我们正在加紧研究，今明应该可以弄好text-audio输入的微调～

代码里面仍然没有关于 audio 的微调

Jul 04 '25 11:07 candle1220

Hey all! We sincerely apologize for our initial misestimation of the progress timeline and the delayed response! During this period, we attempted to fine-tune the text-to-audio-to-text and text-to-audio functionalities. However, due to the exceptionally advanced architecture of qwen2.5-omni, our academic team lacked the necessary engineering expertise, which resulted in the abnormally poor performance of the trained models. This is the primary reason for our prolonged silence.

We are continuing our efforts and will promptly report any breakthroughs. We also welcome community contributions through implementation references, which we will integrate into align-anything.

Once again, our deepest apologies.

Jul 04 '25 16:07 Gaiejj

feat: support qwen_2_5_omni fine-tuning

Description

Test

Lint

Types of changes

Checklist