大模型流式输出卡顿
System Info / 系統信息
xinference镜像版本:xprobe/xinference v1.6.1 模型:qwen2.5-32b 推理引擎:transformer
我用xinfernece通过transformer引擎部署了qwen2.5-32b模型 在推理的时候,发现每隔好几个data块才返回一次,导致前端显示的时候(自带的gradio以及对接到dify平台都是)每隔十几个字符显示一次,感官上很卡顿
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
- [x] docker / docker
- [ ] pip install / 通过 pip install 安装
- [ ] installation from source / 从源码安装
Version info / 版本信息
xinference镜像版本:xprobe/xinference v1.6.1 模型:qwen2.5-32b 推理引擎:transformer
The command used to start Xinference / 用以启动 xinference 的命令
通过docker-compose启动 配置文件如下: version: '3.1' services: rag-xinference: image: xprobe/xinference:v1.6.1 container_name: RAG-Xinference restart: always ports: - "13402:9997" privileged: true volumes: - ./xinference:/root/.xinference environment: - XINFERENCE_MODEL_SRC=modelscope - XINFERENCE_HOME=/root/.xinference - NVIDIA_VISIBLE_DEVICES=0,1 deploy: resources: reservations: devices: - driver: nvidia capabilities: [gpu] # 命令启动 command: xinference-local -H 0.0.0.0
Reproduction / 复现过程
比如以下,1-8个data块一起返回一次,9-16个再一起返回一次,以此类推。 data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881708, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"role": "assistant", "content": ""}, "finish_reason": null}]}
data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881708, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "题目:"}, "finish_reason": null}], "usage": null}
data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881708, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "窗外的世界"}, "finish_reason": null}], "usage": null}
data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881708, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "\n\n打开"}, "finish_reason": null}], "usage": null}
data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881708, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "窗户,"}, "finish_reason": null}], "usage": null}
data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881708, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "清新的"}, "finish_reason": null}], "usage": null}
data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881709, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "空气迎"}, "finish_reason": null}], "usage": null}
data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881709, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "面扑"}, "finish_reason": null}], "usage": null}
data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881709, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "来,"}, "finish_reason": null}], "usage": null}
data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881709, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "带着花"}, "finish_reason": null}], "usage": null}
Expected behavior / 期待表现
能够正常流式响应
追求性能可以尝试 vllm 后端。
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 5 days since being marked as stale.