qwen2.5-instruct模型响应回答很缓慢
System Info / 系統信息
CUDA Version: 12.6 RTX4070 Super,显卡大小:12G
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
- [x] docker / docker
- [ ] pip install / 通过 pip install 安装
- [ ] installation from source / 从源码安装
Version info / 版本信息
xinference, version 1.4.0
The command used to start Xinference / 用以启动 xinference 的命令
通过docker compose启动xinference,docker-compose.yml文件信息如下: services:
xinference: image: xprobe/xinference:v1.4.0 container_name: xinference ports: - "9997:9997" volumes: - /home/eddie/dev/docker-service/data/xinference/.xinference:/root/.xinference - /home/eddie/dev/docker-service/data/xinference/.cache/huggingface:/root/.cache/huggingface - /home/eddie/dev/docker-service/data/xinference/.cache/modelScope:/root/.cache/modelScope - /home/eddie/dev/docker-service/data/xinference/log:/xinference/logs environment: - XINFERENCE_HOME=/xinference - XINFERENCE_MODEL_SRC=modelscope restart: always command: xinference-local -H 0.0.0.0 --log-level debug deploy: resources: reservations: devices: - driver: nvidia capabilities: ["gpu"] count: all
Reproduction / 复现过程
复现流程如下:
1、注册本地模型,信息如下:
{
"version": 1,
"model_name": "qwen2.5-instruct-eddie",
"model_description": "qwen2.5-instruct-eddie",
"context_length": 8000,
"model_lang": [
"en",
"zh"
],
"model_ability": [
"chat"
],
"model_family": "qwen2.5-instruct",
"model_specs": [
{
"model_uri": "/root/.cache/modelScope/models/qwen/Qwen2___5-14B-Instruct",
"model_size_in_billions": 14,
"model_format": "pytorch",
"quantizations": [
"none"
]
}
],
"chat_template": "{%- if tools %}\n {{- '<|im_start|>system\n' }}\n {%- if messages[0]['role'] == 'system' %}\n {{- messages[0]['content'] }}\n {%- else %}\n {{- 'You are Qwen, created by Alibaba Cloud. You are a helpful assistant.' }}\n {%- endif %}\n {{- "\n\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within
2、模型配置,命令行形式如下: xinference launch --model-name qwen2.5-instruct-eddie --model-type LLM --model-engine Transformers --model-format pytorch --size-in-billions 14 --quantization none --n-gpu 1 --replica 1 --n-worker 1
3、启动聊天,响应迟钝,显卡在使用,占用情况:9276MiB / 12282MiB
4、容器日志如下:
Expected behavior / 期待表现
麻烦解答以下这是什么原因导致模型响应缓慢,以及相应的解决办法
镜像里装下 flash_attn
pip install flash-attn --no-build-isolation
稍等,我们今天会发 1.5.0.post2 ,会带上 flash_attn。
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 5 days since being marked as stale.