部署qwen2.5-vl-7b-instrcut模型,5并发,单个请求包含20张图片,xinference会堵塞住,而单独使用底层推理引擎则不会
System Info / 系統信息
vllm 0.7.3 xinference v1.3.1.post1
Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?
- [x] docker / docker
- [ ] pip install / 通过 pip install 安装
- [ ] installation from source / 从源码安装
Version info / 版本信息
vllm 0.7.3 xinference v1.3.1.post1
The command used to start Xinference / 用以启动 xinference 的命令
1.启动docker docker run -d --env NVIDIA_VISIBLE_DEVICES=0,1,2,3 -v /dev/shm:/dev/shm -v /modelscope:/modelscope xprobe/xinference:v1.3.1.post1 sleep 300000000 2.进容器启动 xinference launch --endpoint "http://0.0.0.0:9997" --model-path /modelscope/Qwen2.5-VL-7B-Instruct/ --model-type LLM --model-uid 7b --model-engine vllm --n-gpu 2 --limit_mm_per_prompt “{\“image\”:20}” --max_model_len 32000 3.发起第一轮5并发请求,单个请求包含20张图片,第一轮所有请求均完成。 4.发起第二轮5并发请求,xinference日志输出以下内容后无任何内容输出,请求堵塞住,服务不可用: 2025-03-30 19:03:46,481 xinference.core.worker 440 DEBUG Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,482 xinference.core.supervisor 440 DEBUG [request 6147c72a-0dd4-11f0-a6da-024265656508] Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,484 xinference.core.supervisor 440 DEBUG [request 614846d2-0dd4-11f0-a6da-024265656508] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,484 xinference.core.worker 440 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,484 xinference.core.worker 440 DEBUG Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,485 xinference.core.supervisor 440 DEBUG [request 614846d2-0dd4-11f0-a6da-024265656508] Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,689 xinference.core.supervisor 440 DEBUG [request 6167a810-0dd4-11f0-a6da-024265656508] Enter get_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,690 xinference.core.worker 440 DEBUG Enter get_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,690 xinference.core.worker 440 DEBUG Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,690 xinference.core.supervisor 440 DEBUG [request 6167a810-0dd4-11f0-a6da-024265656508] Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,692 xinference.core.supervisor 440 DEBUG [request 6168231c-0dd4-11f0-a6da-024265656508] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,693 xinference.core.worker 440 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,693 xinference.core.worker 440 DEBUG Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,694 xinference.core.supervisor 440 DEBUG [request 6168231c-0dd4-11f0-a6da-024265656508] Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,886 xinference.core.supervisor 440 DEBUG [request 6185bed6-0dd4-11f0-a6da-024265656508] Enter get_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,887 xinference.core.worker 440 DEBUG Enter get_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,887 xinference.core.worker 440 DEBUG Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,888 xinference.core.supervisor 440 DEBUG [request 6185bed6-0dd4-11f0-a6da-024265656508] Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,890 xinference.core.supervisor 440 DEBUG [request 61863dde-0dd4-11f0-a6da-024265656508] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,890 xinference.core.worker 440 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,890 xinference.core.worker 440 DEBUG Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,891 xinference.core.supervisor 440 DEBUG [request 61863dde-0dd4-11f0-a6da-024265656508] Leave describe_model, elapsed time: 0 s ^C2025-03-30 19:08:37,869 xinference.core.supervisor 440 DEBUG [request 0ef630dc-0dd5-11f0-a6da-024265656508] Enter remove_worker, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,0.0.0.0:33152, kwargs: 5.单独使用vllm进行部署,5并发所有请求均可完成
Reproduction / 复现过程
1.启动docker docker run -d --env NVIDIA_VISIBLE_DEVICES=0,1,2,3 -v /dev/shm:/dev/shm -v /modelscope:/modelscope xprobe/xinference:v1.3.1.post1 sleep 300000000 2.进容器启动 xinference launch --endpoint "http://0.0.0.0:9997" --model-path /modelscope/Qwen2.5-VL-7B-Instruct/ --model-type LLM --model-uid 7b --model-engine vllm --n-gpu 2 --limit_mm_per_prompt “{\“image\”:20}” --max_model_len 32000 3.发起第一轮5并发请求,单个请求包含20张图片,第一轮所有请求均完成。 4.发起第二轮5并发请求,xinference日志输出以下内容后无任何内容输出,请求堵塞住,服务不可用: 2025-03-30 19:03:46,481 xinference.core.worker 440 DEBUG Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,482 xinference.core.supervisor 440 DEBUG [request 6147c72a-0dd4-11f0-a6da-024265656508] Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,484 xinference.core.supervisor 440 DEBUG [request 614846d2-0dd4-11f0-a6da-024265656508] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,484 xinference.core.worker 440 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,484 xinference.core.worker 440 DEBUG Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,485 xinference.core.supervisor 440 DEBUG [request 614846d2-0dd4-11f0-a6da-024265656508] Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,689 xinference.core.supervisor 440 DEBUG [request 6167a810-0dd4-11f0-a6da-024265656508] Enter get_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,690 xinference.core.worker 440 DEBUG Enter get_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,690 xinference.core.worker 440 DEBUG Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,690 xinference.core.supervisor 440 DEBUG [request 6167a810-0dd4-11f0-a6da-024265656508] Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,692 xinference.core.supervisor 440 DEBUG [request 6168231c-0dd4-11f0-a6da-024265656508] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,693 xinference.core.worker 440 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,693 xinference.core.worker 440 DEBUG Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,694 xinference.core.supervisor 440 DEBUG [request 6168231c-0dd4-11f0-a6da-024265656508] Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,886 xinference.core.supervisor 440 DEBUG [request 6185bed6-0dd4-11f0-a6da-024265656508] Enter get_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,887 xinference.core.worker 440 DEBUG Enter get_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,887 xinference.core.worker 440 DEBUG Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,888 xinference.core.supervisor 440 DEBUG [request 6185bed6-0dd4-11f0-a6da-024265656508] Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,890 xinference.core.supervisor 440 DEBUG [request 61863dde-0dd4-11f0-a6da-024265656508] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,890 xinference.core.worker 440 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,890 xinference.core.worker 440 DEBUG Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,891 xinference.core.supervisor 440 DEBUG [request 61863dde-0dd4-11f0-a6da-024265656508] Leave describe_model, elapsed time: 0 s ^C2025-03-30 19:08:37,869 xinference.core.supervisor 440 DEBUG [request 0ef630dc-0dd5-11f0-a6da-024265656508] Enter remove_worker, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,0.0.0.0:33152, kwargs: 5.单独使用vllm进行部署,5并发所有请求均可完成
Expected behavior / 期待表现
xinference能正常接受并发请求
简单来说,两次,每次5个请求,第二次会卡住?
简单来说,两次,每次5个请求,第二次会卡住?
@qinxuye 对的,现象是这个
简单来说,两次,每次5个请求,第二次会卡住?
@qinxuye 补充一下,流式请求不会有这个现象,非流式才会有
我单次请求3张图片或以上就出错,就解决方法
简单来说,两次,每次5个请求,第二次会卡住?
@qinxuye 补充一下,流式请求不会有这个现象,非流式才会有
有点奇怪。我们排查下。
你可能是 limit_mm_per_prompt 参数的原因,vllm 要设这个参数决定上下文最多多少张图片,默认是 2.
我单次请求3张图片或以上就出错,就解决方法
简单来说,两次,每次5个请求,第二次会卡住?
@qinxuye 补充一下,流式请求不会有这个现象,非流式才会有
有点奇怪。我们排查下。
你可能是 limit_mm_per_prompt 参数的原因,vllm 要设这个参数决定上下文最多多少张图片,默认是 2.
界面上将limit_mm_per_prompt 设置为4,启动不了,报错:Server error: 500 - [address=0.0.0.0:38827, pid=2768712] the JSON object must be str, bytes or bytearray, not int
我单次请求3张图片或以上就出错,就解决方法
简单来说,两次,每次5个请求,第二次会卡住?
@qinxuye 补充一下,流式请求不会有这个现象,非流式才会有
有点奇怪。我们排查下。
你可能是 limit_mm_per_prompt 参数的原因,vllm 要设这个参数决定上下文最多多少张图片,默认是 2.
界面上将limit_mm_per_prompt 设置为4,启动不了,报错:Server error: 500 - [address=0.0.0.0:38827, pid=2768712] the JSON object must be str, bytes or bytearray, not int
@kandada 试一下用json格式,{“image”:10}这种
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 5 days since being marked as stale.
@qinxuye 这个有定位到什么问题吗,我是在容器里面手动更新了vllm的版本,pip install vllm==0.7.3,会是版本有冲突吗
@qinxuye 我发现这个复现不用这么麻烦,qwen2-vl系列也有这个问题。只要5并发非流式请求,请求里带张稍微大点的图片,第二轮请求就会全部堵塞住
最新的版本加上了图片的 max_pixels 的限制,不知道是否有影响。等新版出来再试下?
最新的版本加上了图片的 max_pixels 的限制,不知道是否有影响。等新版出来再试下?
@qinxuye 试了下新版本,还是会有这个问题
卡住的现象是,部署的两张显卡,一张利用率一直为100,另一个是0,cpu占用也是100,ctrl+c关闭xinference后,占用率100的显卡显存释放不了。奇怪的是同样的请求,流式正常,非流式会卡住,这俩按理说调用的是vllm的同一个接口,参数也一样的
这么怪异,看着 vllm 进程是不是僵尸了?
这么怪异,看着 vllm 进程是不是僵尸了?
占用GPU和CPU的进程一开始是running,然后变成sleeping
这么怪异,看着 vllm 进程是不是僵尸了?
@qinxuye 这个你们有计划排查下吗,还挺容易出现的,只要是非流式,并发场景就会出现
This issue is stale because it has been open for 7 days with no activity.
@qinxuye 这个问题有复现嘛,我看现在的最新版本好像没有解决这个问题
@amumu96 帮忙定位下。
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 5 days since being marked as stale.
This issue is stale because it has been open for 7 days with no activity.
This issue was closed because it has been inactive for 5 days since being marked as stale.
我这边使用最新版的xinference部署模型后未复现异常,能进入到容器里面拉下执行操作时的控制台的输出吗?
从去年十二月份我在xinference默认参数启动vl模型就出现了这个问题,一旦并发很快就会卡住,而且每次输入只有一张图片,中间更新了xinference几个版本,依旧存在这个问题,不能并发,目前使用xinference的版本为1.5.1仍存在此问题,不知最近版本是否有优化此问题
测试脚本能发下吗?我来看吧。