ChatTTS icon indicating copy to clipboard operation
ChatTTS copied to clipboard

feat(vLLM): support async generator

Open fengyizhu opened this issue 1 year ago • 15 comments

I have provided a draft version of the VLLM, with the following key changes:

  1. Protocol changes: Aligned with OpenAI /v1/audio/speech.
  2. Removed the custom inference part, keeping the inference logic consistent between streaming and non-streaming.
  3. Optimized some potential issues of inconsistency caused by streaming inference.

These changes may have a significant impact. Feel free to leave comments to guide me in further improvements.

fengyizhu avatar Sep 09 '24 02:09 fengyizhu

这一个版本太强大了,我用4090测,流式api第一个chunk 65毫秒,而且音色能固定,太快了,太快了,怀疑电脑坏了

IrisSally avatar Sep 09 '24 04:09 IrisSally

感谢大佬分享。这个版本能和原版本保持相同音色吗,这个issue中有提到 #640

ZaymeShaw avatar Sep 09 '24 05:09 ZaymeShaw

感谢大佬分享。这个版本能和原版本保持相同音色吗,这个issue中有提到 #640

It is fully compatible in principle.

fengyizhu avatar Sep 09 '24 05:09 fengyizhu

这一个版本太强大了,我用4090测,流式api第一个chunk 65毫秒,而且音色能固定,太快了,太快了,怀疑电脑坏了

4090上使用compile=True耗时3.3s的文本,用main分支vllm加速是1.6s,不能调整音色,用这个pr的版本可以调整,但速度只有2.6s左右

LLongIsland avatar Sep 09 '24 07:09 LLongIsland

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch

also @fumiama for this question

niuzheng168 avatar Sep 19 '24 08:09 niuzheng168

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm.

Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch

also @fumiama for this question

This part of code is contributed by the community and you can ask @ylzz1997 about that. I'm sorry but I'm not familiar with vLLM 😂.

fumiama avatar Sep 19 '24 08:09 fumiama

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch

also @fumiama for this question

Because Vllm don't suport some feature:

  1. custom lm-head
  2. multi-codebook sampler (custom sampler)
  3. sample without tokenizer

In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.

That's all

ylzz1997 avatar Sep 19 '24 08:09 ylzz1997

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch also @fumiama for this question

Because Vllm don't suport some feature:

  1. custom lm-head
  2. multi-codebook sampler (custom sampler)
  3. sample without tokenizer

In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.

That's all

Thanks for your explanation.

fumiama avatar Sep 20 '24 07:09 fumiama

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch also @fumiama for this question

Because Vllm don't suport some feature:

  1. custom lm-head
  2. multi-codebook sampler (custom sampler)
  3. sample without tokenizer

In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.

That's all

Yes, we cannot use vllm directly as it requires some code changes. I have implemented a solution here, I am new to Torch and Python, so feel free for any comments. Only for the llama part: Model: https://github.com/niuzheng168/vllm/blob/dev/vllm/model_executor/models/chattts.py Sample usage: https://github.com/niuzheng168/vllm/blob/dev/chattts_sample.py

For the issues you list above:

  • We can create multiple lm_heads.
  • I noticed it runs slowly when we run the sample N times, so I made a lite version of the multi-head sampler.
  • This is already supported in vllm by setting detokenize=false.

One of the main challenges is that vllm assumes all the model output is a single token, which is just an int value. However, the TTS system, whether chattts or fishtts, generates multi-head tokens in one decoding step. This means the model output is a token list, breaking the fundamental design. I had to use many if/else statements to ensure the whole pipeline still works.

Overall, compared to moving vllm codes here, implementing a model in vllm will save effort for other features than core model inference, like sampling and scheduling, continual batch processing, etc. I also reply the road map of vllm thread, to see if push vllm support the model official, I believe more and more multi-modal models is using similar model arch, especially for those gpt-4o like models.

niuzheng168 avatar Sep 29 '24 15:09 niuzheng168

I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch also @fumiama for this question

Because Vllm don't suport some feature:

  1. custom lm-head
  2. multi-codebook sampler (custom sampler)
  3. sample without tokenizer

In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code. That's all

Yes, we cannot use vllm directly as it requires some code changes. I have implemented a solution here, I am new to Torch and Python, so feel free for any comments. Only for the llama part: Model: https://github.com/niuzheng168/vllm/blob/dev/vllm/model_executor/models/chattts.py Sample usage: https://github.com/niuzheng168/vllm/blob/dev/chattts_sample.py

For the issues you list above:

  • We can create multiple lm_heads.
  • I noticed it runs slowly when we run the sample N times, so I made a lite version of the multi-head sampler.
  • This is already supported in vllm by setting detokenize=false.

One of the main challenges is that vllm assumes all the model output is a single token, which is just an int value. However, the TTS system, whether chattts or fishtts, generates multi-head tokens in one decoding step. This means the model output is a token list, breaking the fundamental design. I had to use many if/else statements to ensure the whole pipeline still works.

Overall, compared to moving vllm codes here, implementing a model in vllm will save effort for other features than core model inference, like sampling and scheduling, continual batch processing, etc. I also reply the road map of vllm thread, to see if push vllm support the model official, I believe more and more multi-modal models is using similar model arch, especially for those gpt-4o like models.

Thanks for your great effort. We welcome you to contribute your code into this repo if you like.

fumiama avatar Oct 01 '24 05:10 fumiama

这一个版本太强大了,我用4090测,流式api第一个chunk 65毫秒,而且音色能固定,太快了,太快了,怀疑电脑坏了

4090上使用compile=True耗时3.3s的文本,用main分支vllm加速是1.6s,不能调整音色,用这个pr的版本可以调整,但速度只有2.6s左右

流式api第一个chunk 65毫秒,请问大佬能不能有偿教一下怎么把第一个合成的chunk马上播放出来

xiaohua-drive avatar Dec 20 '24 11:12 xiaohua-drive

这个版本太强大了,我用4090测,流式api第一个块65毫秒,而且音色能力固定,太快了,太快了,怀疑电脑坏了

流式api第一个块65毫秒,请问大佬能不能好好教一下怎么把第一个合成的块马上出来播放

xiaohua-drive avatar Dec 20 '24 11:12 xiaohua-drive

使用vLLM加速后,貌似不能流式输出了。

ai408 avatar Mar 21 '25 08:03 ai408

@fengyizhu @fumiama 请问使用 vLLM 后,还能进行批量推理吗? 就是 chat.infer(texts, split_text=False)中当传入 texts 为列表时,推理出的音频为 0s

还想问问大佬们 vLLM 只能使用单音色的问题现在解决了吗?

CJY1018 avatar Mar 26 '25 12:03 CJY1018

@fengyizhu @fumiama 请问使用 vLLM 后,还能进行批量推理吗? 就是 chat.infer(texts, split_text=False)中当传入 texts 为列表时,推理出的音频为 0s

还想问问大佬们 vLLM 只能使用单音色的问题现在解决了吗?

yes

fengyizhu avatar Apr 21 '25 01:04 fengyizhu

@fengyizhu @fumiama 请问使用 vLLM 后,还能进行批量推理吗? 就是 chat.infer(texts, split_text=False)中当传入 texts 为列表时,推理出的音频为 0s

还想问问大佬们 vLLM 只能使用单音色的问题现在解决了吗?

https://github.com/fengyizhu/ChatTTS-VLLM 修复版

fengyizhu avatar Jun 12 '25 23:06 fengyizhu