inference 显存占用，但是模型不能调用

System Info / 系統信息

Cuda version: 12.8 python version: 3.10.14 dockers version: 24.0.2 xinference version: v1.3.1.post1

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece？

[x] docker / docker
[ ] pip install / 通过 pip install 安装
[ ] installation from source / 从源码安装

Version info / 版本信息

v1.3.1.post1

The command used to start Xinference / 用以启动 xinference 的命令

docker run -d
--name xinference
-e XINFERENCE_HOME=/models
-v /data/model/llm/pre_train/:/models
-p 6018:9997
--gpus all
registry.cn-hangzhou.aliyuncs.com/xprobe_xinference/xinference:v1.3.1.post1
xinference-local -H 0.0.0.0

Reproduction / 复现过程

下载qwen2.5-32b-instruct-q5_k_m,部署在一张80GA100显卡上；
启动一个并发任务跑一段时间后，此时会出现2种情况：

模型在后台重新加载
模型显示仍然在线，显存也在占用，但是调用无响应，无法从前端卸载模型，此时只能重启docker容器

Expected behavior / 期待表现

期待能修复问题，能正常使用

Apr 17 '25 03:04 wanyuks

我也有类似问题，并发跑一段时间后请求就会阻塞住，显存是正常占用的

Apr 17 '25 03:04 kelliaao

都是什么引擎？

Apr 17 '25 03:04 qinxuye

都是什么引擎？

llama.cpp

Apr 17 '25 03:04 wanyuks

都是什么引擎？

llama.cpp

开启 xllamacpp 了吗

Apr 17 '25 03:04 qinxuye

都是什么引擎？

我用的是vllm

Apr 17 '25 03:04 kelliaao

都是什么引擎？

我用的是vllm

vllm 有没有出现 crash 的情况？可能是 vllm 已经死掉了，自动恢复没有起作用。

Apr 17 '25 03:04 qinxuye

docker部署xinference后启动rerank模型，出现显存占用不释放问题；
xinference基础信息：Name: xinferenceVersion: 1.4.0 Summary: Model Serving Made Easy Home-page: https://github.com/xorbitsai/inference Author: Qin Xuye Author-email: [email protected] License: Apache License 2.0 Location: /usr/local/lib/python3.10/dist-packages Requires: aioprometheus, async-timeout, click, fastapi, gradio, huggingface-hub, modelscope, nvidia-ml-py, openai, passlib, peft, pillow, pydantic, pynvml, python-jose, requests, setproctitle, sse-starlette, tabulate, timm, torch, tqdm, typing-extensions, uvicorn, xoscar
启动是显存占用情况如下：