MiniCPM-V 💡 [REQUEST] - 支持Audio的微调方案

起始日期 | Start Date

No response

实现PR | Implementation PR

之后会更新MiniCPM-O的audio到text的微调方案吗？目前我自己只能根据model_server里的处理流程，试着把audio处理成推理的样子

摘要 | Summary

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions

Jan 17 '25 09:01 Lingeng56

😄 Hey! I think our framework, align-anything, has implemented this functionality. We have fine-tuned it on our open-source align-anything/text-audio-to-text dataset and provided a directly runnable script. Everyone is welcome to use it!

Jan 18 '25 14:01 Gaiejj

你好,很高兴你有微调的兴趣,audio到text的微调方案几乎和image到text的相差不大,修改成本比较小. 我们会在下周给出示例代码.

Jan 18 '25 15:01 Cuiunbo

LLaMA-Factory has supported audio-text to text fine-tuning and inference, you can also try it 🤗

https://github.com/hiyouga/LLaMA-Factory/pull/6701

Jan 19 '25 07:01 BUAADreamer

你好,很高兴你有微调的兴趣,audio到text的微调方案几乎和image到text的相差不大,修改成本比较小. 我们会在下周给出示例代码.

我看到模型架构的audio encoder似乎与qwen是分离的，如果我的数据是有输入audio对应文本的，我是不是也可以直接去做text2text的sft

Jan 20 '25 03:01 Lingeng56

你好,很高兴你有微调的兴趣,audio到text的微调方案几乎和image到text的相差不大,修改成本比较小. 我们会在下周给出示例代码.

我看到模型架构的audio encoder似乎与qwen是分离的，如果我的数据是有输入audio对应文本的，我是不是也可以直接去做text2text的sft

您好，这种方式可能会导致无法完成音频输入情况下的对齐，您可以尝试使用https://github.com/hiyouga/LLaMA-Factory/pull/6701来进行微调，已经支持 audio 2 text啦

Jan 21 '25 03:01 Cuiunbo

你好,很高兴你有微调的兴趣,audio到text的微调方案几乎和image到text的相差不大,修改成本比较小. 我们会在下周给出示例代码.

我看到模型架构的audio encoder似乎与qwen是分离的，如果我的数据是有输入audio对应文本的，我是不是也可以直接去做text2text的sft

您好，这种方式可能会导致无法完成音频输入情况下的对齐，您可以尝试使用https://github.com/hiyouga/LLaMA-Factory/pull/6701来进行微调，已经支持 audio 2 text啦

您好，请问一下，多个音频比如一个是用于声音克隆的音频，一个是需要改变声音的音频，这种场景的微调数据json大概是什么样的？

Jan 22 '25 06:01 uangshiyon

你好,很高兴你有微调的兴趣,audio到text的微调方案几乎和image到text的相差不大,修改成本比较小. 我们会在下周给出示例代码.

我看到模型架构的audio encoder似乎与qwen是分离的，如果我的数据是有输入audio对应文本的，我是不是也可以直接去做text2text的sft

您好，这种方式可能会导致无法完成音频输入情况下的对齐，您可以尝试使用https://github.com/hiyouga/LLaMA-Factory/pull/6701来进行微调，已经支持 audio 2 text啦

hello，我正在用你们添加了support的那个branch进行微调，有一个问题，我可不可以使用我的custom system prompt来作为微调的system prompt？比如我希望我微调之后，模型的行为是我输入什么都会翻译成英文，我希望把system的prompt修改以跟正常qa问答的system prompt区分开，可以做到吗？

Jan 24 '25 08:01 Lingeng56

我这边试图在跑LLama-Factory的代码，发现跑不通呢，会报错，传入的processor是一个None。然后LLama-Factory的main branch也把你们的pr pending了

Jan 24 '25 09:01 Lingeng56

😄 Hey! I think our framework, align-anything, has implemented this functionality. We have fine-tuned it on our open-source align-anything/text-audio-to-text dataset and provided a directly runnable script. Everyone is welcome to use it!

请问你们支持lora sft吗？目前我在sft.py的源码里似乎没有看到lora的option

Jan 24 '25 09:01 Lingeng56

😄 Hey! I think our framework, align-anything, has implemented this functionality. We have fine-tuned it on our open-source align-anything/text-audio-to-text dataset and provided a directly runnable script. Everyone is welcome to use it!

请问你们支持lora sft吗？目前我在sft.py的源码里似乎没有看到lora的option

还没有测试过，最近会支持上，您可以先试试全参～

Jan 24 '25 09:01 Gaiejj

hello，我正在用你们添加了support的那个branch进行微调，有一个问题，我可不可以使用我的custom system prompt来作为微调的system prompt？比如我希望我微调之后，模型的行为是我输入什么都会翻译成英文，我希望把system的prompt修改以跟正常qa问答的system prompt区分开，可以做到吗？

当然支持，您可以直接在这里 https://github.com/BUAADreamer/LLaMA-Factory/blob/13d252fa7856ecb14ba6907e5adb10070e5cdde4/src/llamafactory/data/template.py#L958 新增你的模板，加上以下几行：

_register_template(
    name="minicpm_o_audio",
    format_user=StringFormatter(slots=["<|im_start|>user\n{{content}}<|im_end|>\n<|im_start|>assistant\n"]),
    format_assistant=StringFormatter(slots=["{{content}}<|im_end|>\n"]),
    format_system=StringFormatter(slots=["<|im_start|>system\n{{content}}<|im_end|>\n"]),
    stop_words=["<|im_end|>"],
    default_system=(
        "不管输入什么都需要直接翻译为英文"
    ),
    mm_plugin=get_mm_plugin(name="minicpm_v", image_token="<image>", video_token="<video>"),
)

并在yaml文件中使用 template: minicpm_o_audio 即可

Jan 24 '25 10:01 BUAADreamer

😄 Hey! I think our framework, align-anything, has implemented this functionality. We have fine-tuned it on our open-source align-anything/text-audio-to-text dataset and provided a directly runnable script. Everyone is welcome to use it!

请问你们支持lora sft吗？目前我在sft.py的源码里似乎没有看到lora的option

还没有测试过，最近会支持上，您可以先试试全参～

不知道有没有tut可以让我使用自己的数据集，我现在只能查代码看看怎么把自己的数据用来做training

Jan 24 '25 10:01 Lingeng56

hello，我正在用你们添加了support的那个branch进行微调，有一个问题，我可不可以使用我的custom system prompt来作为微调的system prompt？比如我希望我微调之后，模型的行为是我输入什么都会翻译成英文，我希望把system的prompt修改以跟正常qa问答的system prompt区分开，可以做到吗？

当然支持，您可以直接在这里自定义你的模板为如下格式：

_register_template( name="minicpm_o_audio", format_user=StringFormatter(slots=["<|im_start|>user\n{{content}}<|im_end|>\n<|im_start|>assistant\n"]), format_assistant=StringFormatter(slots=["{{content}}<|im_end|>\n"]), format_system=StringFormatter(slots=["<|im_start|>system\n{{content}}<|im_end|>\n"]), stop_words=["<|im_end|>"], default_system=( "不管输入什么都需要直接翻译为英文" ), mm_plugin=get_mm_plugin(name="minicpm_v", image_token="", video_token="

hello 我跑了您修改后的llama factory的branch，会报process是Nonetype的Error

Jan 24 '25 10:01 Lingeng56

暂时推荐使用transformers==4.45.0，可以稳定跑通微调和推理

【重要】使用以下方式安装最新的llamafactory以及相应的库

git clone --depth 1 https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory
pip install -e ".[torch,metrics,deepspeed,minicpm_v]"
pip3 install transformers==4.45.0
pip3 install huggingface_hub==0.25.0

Jan 24 '25 12:01 BUAADreamer

😄 Hey! I think our framework, align-anything, has implemented this functionality. We have fine-tuned it on our open-source align-anything/text-audio-to-text dataset and provided a directly runnable script. Everyone is welcome to use it!

请问你们支持lora sft吗？目前我在sft.py的源码里似乎没有看到lora的option

还没有测试过，最近会支持上，您可以先试试全参～

不知道有没有tut可以让我使用自己的数据集，我现在只能查代码看看怎么把自己的数据用来做training

其实readme和文档主页就有示例，您看看能不能满足您的需求？

Jan 24 '25 19:01 Gaiejj

💡 [REQUEST] - 支持Audio的微调方案

起始日期 | Start Date

实现PR | Implementation PR

相关Issues | Reference Issues

摘要 | Summary

基本示例 | Basic Example

缺陷 | Drawbacks

未解决问题 | Unresolved questions