inference CosyVoice流式输出报错Parallel generation is not supported by llama-cpp-python

ubuntu 22.04 cuda 12.4

0.15.2

xinference-local --host 0.0.0.0 --port 9997

无论是在 windows 系统的 ipynb 中，还是在 ubuntu 的命令行中，都无法使用 CosyVoice 的流式输出。

有人在 2278 中也提到了这个问题，但是好像没有下文了。

Sep 30 '24 08:09 nooooom

主要是 llama cpp 不是线程安全的，多个请求同时推理会挂：https://github.com/abetlen/llama-cpp-python/issues/471

Sep 30 '24 14:09 codingl2k1

abetlen/llama-cpp-python#471

我确定当时只有我一个人在请求，并没有并发，因为是我自己部署的模型，没有其他人知道这个服务，会不会是有其它 BUG 导致了并发🤣

在我发送了上边的 issue 之后，我把 stream 设置为 false 也会报这个错，最后重启了 xinference 和模型后才恢复正常。

Sep 30 '24 14:09 nooooom

我尝试复现一下，如果没并发请求那应该是哪儿有 bug。

Sep 30 '24 14:09 codingl2k1

CosyVoice流式多次生成没问题啊，只是没开放并行生成。前一个流式生成没结束，第二个请求过去是会提示错误的。我是这样测试的：

Oct 01 '24 18:10 codingl2k1

This issue is stale because it has been open for 7 days with no activity.

Oct 08 '24 19:10 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

Oct 13 '24 19:10 github-actions[bot]