feat(vLLM): support async generator
I have provided a draft version of the VLLM, with the following key changes:
- Protocol changes: Aligned with OpenAI /v1/audio/speech.
- Removed the custom inference part, keeping the inference logic consistent between streaming and non-streaming.
- Optimized some potential issues of inconsistency caused by streaming inference.
These changes may have a significant impact. Feel free to leave comments to guide me in further improvements.
这一个版本太强大了,我用4090测,流式api第一个chunk 65毫秒,而且音色能固定,太快了,太快了,怀疑电脑坏了
感谢大佬分享。这个版本能和原版本保持相同音色吗,这个issue中有提到 #640
感谢大佬分享。这个版本能和原版本保持相同音色吗,这个issue中有提到 #640
It is fully compatible in principle.
这一个版本太强大了,我用4090测,流式api第一个chunk 65毫秒,而且音色能固定,太快了,太快了,怀疑电脑坏了
4090上使用compile=True耗时3.3s的文本,用main分支vllm加速是1.6s,不能调整音色,用这个pr的版本可以调整,但速度只有2.6s左右
I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch
also @fumiama for this question
I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm.
Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch
also @fumiama for this question
This part of code is contributed by the community and you can ask @ylzz1997 about that. I'm sorry but I'm not familiar with vLLM 😂.
I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch
also @fumiama for this question
Because Vllm don't suport some feature:
- custom lm-head
- multi-codebook sampler (custom sampler)
- sample without tokenizer
In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.
That's all
I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch also @fumiama for this question
Because Vllm don't suport some feature:
- custom lm-head
- multi-codebook sampler (custom sampler)
- sample without tokenizer
In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.
That's all
Thanks for your explanation.
I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch also @fumiama for this question
Because Vllm don't suport some feature:
- custom lm-head
- multi-codebook sampler (custom sampler)
- sample without tokenizer
In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code.
That's all
Yes, we cannot use vllm directly as it requires some code changes. I have implemented a solution here, I am new to Torch and Python, so feel free for any comments. Only for the llama part: Model: https://github.com/niuzheng168/vllm/blob/dev/vllm/model_executor/models/chattts.py Sample usage: https://github.com/niuzheng168/vllm/blob/dev/chattts_sample.py
For the issues you list above:
- We can create multiple lm_heads.
- I noticed it runs slowly when we run the sample N times, so I made a lite version of the multi-head sampler.
- This is already supported in vllm by setting detokenize=false.
One of the main challenges is that vllm assumes all the model output is a single token, which is just an int value. However, the TTS system, whether chattts or fishtts, generates multi-head tokens in one decoding step. This means the model output is a token list, breaking the fundamental design. I had to use many if/else statements to ensure the whole pipeline still works.
Overall, compared to moving vllm codes here, implementing a model in vllm will save effort for other features than core model inference, like sampling and scheduling, continual batch processing, etc. I also reply the road map of vllm thread, to see if push vllm support the model official, I believe more and more multi-modal models is using similar model arch, especially for those gpt-4o like models.
I am curious that why we choose move vllm codes to this project, instead of supporting the llm parts in vllm. Supporting the model in vllm will make such feature auto enabled, like streaming generate and continually batch also @fumiama for this question
Because Vllm don't suport some feature:
- custom lm-head
- multi-codebook sampler (custom sampler)
- sample without tokenizer
In order for the sampler to support vllm features such as continuous batch and streaming generate, it is necessary to package the post model (lm_cead and others) into vllm schedular. This requires rewriting the VLLM part of the code. That's all
Yes, we cannot use vllm directly as it requires some code changes. I have implemented a solution here, I am new to Torch and Python, so feel free for any comments. Only for the llama part: Model: https://github.com/niuzheng168/vllm/blob/dev/vllm/model_executor/models/chattts.py Sample usage: https://github.com/niuzheng168/vllm/blob/dev/chattts_sample.py
For the issues you list above:
- We can create multiple lm_heads.
- I noticed it runs slowly when we run the sample N times, so I made a lite version of the multi-head sampler.
- This is already supported in vllm by setting detokenize=false.
One of the main challenges is that vllm assumes all the model output is a single token, which is just an int value. However, the TTS system, whether chattts or fishtts, generates multi-head tokens in one decoding step. This means the model output is a token list, breaking the fundamental design. I had to use many if/else statements to ensure the whole pipeline still works.
Overall, compared to moving vllm codes here, implementing a model in vllm will save effort for other features than core model inference, like sampling and scheduling, continual batch processing, etc. I also reply the road map of vllm thread, to see if push vllm support the model official, I believe more and more multi-modal models is using similar model arch, especially for those gpt-4o like models.
Thanks for your great effort. We welcome you to contribute your code into this repo if you like.
这一个版本太强大了,我用4090测,流式api第一个chunk 65毫秒,而且音色能固定,太快了,太快了,怀疑电脑坏了
4090上使用compile=True耗时3.3s的文本,用main分支vllm加速是1.6s,不能调整音色,用这个pr的版本可以调整,但速度只有2.6s左右
流式api第一个chunk 65毫秒,请问大佬能不能有偿教一下怎么把第一个合成的chunk马上播放出来
这个版本太强大了,我用4090测,流式api第一个块65毫秒,而且音色能力固定,太快了,太快了,怀疑电脑坏了
流式api第一个块65毫秒,请问大佬能不能好好教一下怎么把第一个合成的块马上出来播放
使用vLLM加速后,貌似不能流式输出了。
@fengyizhu @fumiama 请问使用 vLLM 后,还能进行批量推理吗? 就是 chat.infer(texts, split_text=False)中当传入 texts 为列表时,推理出的音频为 0s
还想问问大佬们 vLLM 只能使用单音色的问题现在解决了吗?
@fengyizhu @fumiama 请问使用 vLLM 后,还能进行批量推理吗? 就是 chat.infer(texts, split_text=False)中当传入 texts 为列表时,推理出的音频为 0s
还想问问大佬们 vLLM 只能使用单音色的问题现在解决了吗?
yes
@fengyizhu @fumiama 请问使用 vLLM 后,还能进行批量推理吗? 就是 chat.infer(texts, split_text=False)中当传入 texts 为列表时,推理出的音频为 0s
还想问问大佬们 vLLM 只能使用单音色的问题现在解决了吗?
https://github.com/fengyizhu/ChatTTS-VLLM 修复版