inference qwen2.5-instruct模型响应回答很缓慢

System Info / 系統信息

CUDA Version: 12.6 RTX4070 Super，显卡大小：12G

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[x] docker / docker
[ ] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

xinference, version 1.4.0

The command used to start Xinference / 用以启动 xinference 的命令

通过docker compose启动xinference，docker-compose.yml文件信息如下： services:

xinference: image: xprobe/xinference:v1.4.0 container_name: xinference ports: - "9997:9997" volumes: - /home/eddie/dev/docker-service/data/xinference/.xinference:/root/.xinference - /home/eddie/dev/docker-service/data/xinference/.cache/huggingface:/root/.cache/huggingface - /home/eddie/dev/docker-service/data/xinference/.cache/modelScope:/root/.cache/modelScope - /home/eddie/dev/docker-service/data/xinference/log:/xinference/logs environment: - XINFERENCE_HOME=/xinference - XINFERENCE_MODEL_SRC=modelscope restart: always command: xinference-local -H 0.0.0.0 --log-level debug deploy: resources: reservations: devices: - driver: nvidia capabilities: ["gpu"] count: all

Reproduction / 复现过程

复现流程如下： 1、注册本地模型，信息如下： { "version": 1, "model_name": "qwen2.5-instruct-eddie", "model_description": "qwen2.5-instruct-eddie", "context_length": 8000, "model_lang": [ "en", "zh" ], "model_ability": [ "chat" ], "model_family": "qwen2.5-instruct", "model_specs": [ { "model_uri": "/root/.cache/modelScope/models/qwen/Qwen2___5-14B-Instruct", "model_size_in_billions": 14, "model_format": "pytorch", "quantizations": [ "none" ] } ], "chat_template": "{%- if tools %}\n {{- '<|im_start|>system\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within XML tags:\n" }}\n {%- for tool in tools %}\n {{- "\n" }}\n {{- tool | tojson }}\n {%- endfor %}\n {{- "\n\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": , \"arguments\": }\n</tool_call><|im_end|>\n" }}\n{%- else %}\n {%- if messages[0]['role'] == 'system' %}\n {{- '<|im_start|>system\n' + messages[0]['content'] + '<|im_end|>\n' }}\n {%- else %}\n {{- '<|im_start|>system\nYou are Qwen, created by Alibaba Cloud. You are a helpful assistant.<|im_end|>\n' }}\n {%- endif %}\n{%- endif %}\n{%- for message in messages %}\n {%- if (message.role == "user") or (message.role == "system" and not loop.first) or (message.role == "assistant" and not message.tool_calls) %}\n {{- '<|im_start|>' + message.role + '\n' + message.content + '<|im_end|>' + '\n' }}\n {%- elif message.role == "assistant" %}\n {{- '<|im_start|>' + message.role }}\n {%- if message.content %}\n {{- '\n' + message.content }}\n {%- endif %}\n {%- for tool_call in message.tool_calls %}\n {%- if tool_call.function is defined %}\n {%- set tool_call = tool_call.function %}\n {%- endif %}\n {{- '\n<tool_call>\n{"name": "' }}\n {{- tool_call.name }}\n {{- '", "arguments": ' }}\n {{- tool_call.arguments | tojson }}\n {{- '}\n</tool_call>' }}\n {%- endfor %}\n {{- '<|im_end|>\n' }}\n {%- elif message.role == "tool" %}\n {%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != "tool") %}\n {{- '<|im_start|>user' }}\n {%- endif %}\n {{- '\n<tool_response>\n' }}\n {{- message.content }}\n {{- '\n</tool_response>' }}\n {%- if loop.last or (messages[loop.index0 + 1].role != "tool") %}\n {{- '<|im_end|>\n' }}\n {%- endif %}\n {%- endif %}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|im_start|>assistant\n' }}\n{%- endif %}\n", "stop_token_ids": [ 151643, 151644, 151645 ], "stop": [ "<|endoftext|>", "<|im_start|>", "<|im_end|>" ] }

2、模型配置，命令行形式如下： xinference launch --model-name qwen2.5-instruct-eddie --model-type LLM --model-engine Transformers --model-format pytorch --size-in-billions 14 --quantization none --n-gpu 1 --replica 1 --n-worker 1

3、启动聊天，响应迟钝，显卡在使用，占用情况：9276MiB / 12282MiB

4、容器日志如下：

Expected behavior / 期待表现

麻烦解答以下这是什么原因导致模型响应缓慢，以及相应的解决办法

Apr 19 '25 08:04 leo-bot

镜像里装下 flash_attn

pip install flash-attn --no-build-isolation

Apr 21 '25 03:04 qinxuye

稍等，我们今天会发 1.5.0.post2 ，会带上 flash_attn。

Apr 21 '25 03:04 qinxuye

This issue is stale because it has been open for 7 days with no activity.

Apr 28 '25 19:04 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

May 03 '25 19:05 github-actions[bot]