inference icon indicating copy to clipboard operation
inference copied to clipboard

大模型流式输出卡顿

Open uncleunclelu opened this issue 2 months ago • 2 comments

System Info / 系統信息

xinference镜像版本:xprobe/xinference v1.6.1 模型:qwen2.5-32b 推理引擎:transformer

我用xinfernece通过transformer引擎部署了qwen2.5-32b模型 在推理的时候,发现每隔好几个data块才返回一次,导致前端显示的时候(自带的gradio以及对接到dify平台都是)每隔十几个字符显示一次,感官上很卡顿

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • [x] docker / docker
  • [ ] pip install / 通过 pip install 安装
  • [ ] installation from source / 从源码安装

Version info / 版本信息

xinference镜像版本:xprobe/xinference v1.6.1 模型:qwen2.5-32b 推理引擎:transformer

The command used to start Xinference / 用以启动 xinference 的命令

通过docker-compose启动 配置文件如下: version: '3.1' services: rag-xinference: image: xprobe/xinference:v1.6.1 container_name: RAG-Xinference restart: always ports: - "13402:9997" privileged: true volumes: - ./xinference:/root/.xinference environment: - XINFERENCE_MODEL_SRC=modelscope - XINFERENCE_HOME=/root/.xinference - NVIDIA_VISIBLE_DEVICES=0,1 deploy: resources: reservations: devices: - driver: nvidia capabilities: [gpu] # 命令启动 command: xinference-local -H 0.0.0.0

Reproduction / 复现过程

比如以下,1-8个data块一起返回一次,9-16个再一起返回一次,以此类推。 data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881708, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"role": "assistant", "content": ""}, "finish_reason": null}]}

data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881708, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "题目:"}, "finish_reason": null}], "usage": null}

data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881708, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "窗外的世界"}, "finish_reason": null}], "usage": null}

data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881708, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "\n\n打开"}, "finish_reason": null}], "usage": null}

data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881708, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "窗户,"}, "finish_reason": null}], "usage": null}

data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881708, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "清新的"}, "finish_reason": null}], "usage": null}

data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881709, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "空气迎"}, "finish_reason": null}], "usage": null}

data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881709, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "面扑"}, "finish_reason": null}], "usage": null}

data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881709, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "来,"}, "finish_reason": null}], "usage": null}

data: {"id": "chat14ee3bf3-9565-40fc-93f4-3550b561bf4a", "model": "qwen2.5-instruct", "created": 1761881709, "object": "chat.completion.chunk", "choices": [{"index": 0, "delta": {"content": "带着花"}, "finish_reason": null}], "usage": null}

Expected behavior / 期待表现

能够正常流式响应

uncleunclelu avatar Oct 31 '25 03:10 uncleunclelu

追求性能可以尝试 vllm 后端。

qinxuye avatar Nov 03 '25 03:11 qinxuye

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Nov 10 '25 19:11 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

github-actions[bot] avatar Nov 16 '25 19:11 github-actions[bot]