CosyVoice How to train with instruction text

From what I can see, the LibriTTS example doesn't include training with instruction, and only included the tts text. Could you show how it could be done for a new dataset? Thanks

May 17 '25 15:05 Ferdydh

I'd assume by putting: <|endofprompt|>

but would a prompt without the explicit splitting via <|endofprompt|> here work?

For example, Speechcraft's dataset In the category of Relationships and Politics, reflecting on her curiosity, a calm adult female with high pitch and low volume ponders:""What could it contain?"" Speaking at a slower pace, she ponders the possibilities.

May 18 '25 13:05 Ferdydh

yes, if you want to train a instruct tts model, use prompt_text<|endofprompt|>tts_text in the prepared data, follow cosyvoice.inference_instruct2 data format

May 26 '25 03:05 aluminumbox

@aluminumbox so it won't work without explicit <|endofprompt|>?

May 26 '25 10:05 Ferdydh

could someone speak on their experience if training without the explicit separation worked well for them? that'd be great

Jun 05 '25 21:06 Ferdydh

yes, if you want to train a instruct tts model, use prompt_text<|endofprompt|>tts_text in the prepared data, follow cosyvoice.inference_instruct2 data format

如果流式训练的话，这个prompt_text<|endofprompt|>部分不应该整体编码之后，每次都要拼接在每一个切分块之前吗，但是我看现在的代码似乎就直接当普通文本一起切了，训练和推理不是不一致了吗

Jun 19 '25 19:06 jokerlj92