inference icon indicating copy to clipboard operation
inference copied to clipboard

部署qwen2.5-vl-7b-instrcut模型,5并发,单个请求包含20张图片,xinference会堵塞住,而单独使用底层推理引擎则不会

Open kelliaao opened this issue 8 months ago • 20 comments

System Info / 系統信息

vllm 0.7.3 xinference v1.3.1.post1

Running Xinference with Docker? / 是否使用 Docker 运行 Xinfernece?

  • [x] docker / docker
  • [ ] pip install / 通过 pip install 安装
  • [ ] installation from source / 从源码安装

Version info / 版本信息

vllm 0.7.3 xinference v1.3.1.post1

The command used to start Xinference / 用以启动 xinference 的命令

1.启动docker docker run -d --env NVIDIA_VISIBLE_DEVICES=0,1,2,3 -v /dev/shm:/dev/shm -v /modelscope:/modelscope xprobe/xinference:v1.3.1.post1 sleep 300000000 2.进容器启动 xinference launch --endpoint "http://0.0.0.0:9997" --model-path /modelscope/Qwen2.5-VL-7B-Instruct/ --model-type LLM --model-uid 7b --model-engine vllm --n-gpu 2 --limit_mm_per_prompt “{\“image\”:20}” --max_model_len 32000 3.发起第一轮5并发请求,单个请求包含20张图片,第一轮所有请求均完成。 4.发起第二轮5并发请求,xinference日志输出以下内容后无任何内容输出,请求堵塞住,服务不可用: 2025-03-30 19:03:46,481 xinference.core.worker 440 DEBUG Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,482 xinference.core.supervisor 440 DEBUG [request 6147c72a-0dd4-11f0-a6da-024265656508] Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,484 xinference.core.supervisor 440 DEBUG [request 614846d2-0dd4-11f0-a6da-024265656508] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,484 xinference.core.worker 440 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,484 xinference.core.worker 440 DEBUG Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,485 xinference.core.supervisor 440 DEBUG [request 614846d2-0dd4-11f0-a6da-024265656508] Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,689 xinference.core.supervisor 440 DEBUG [request 6167a810-0dd4-11f0-a6da-024265656508] Enter get_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,690 xinference.core.worker 440 DEBUG Enter get_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,690 xinference.core.worker 440 DEBUG Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,690 xinference.core.supervisor 440 DEBUG [request 6167a810-0dd4-11f0-a6da-024265656508] Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,692 xinference.core.supervisor 440 DEBUG [request 6168231c-0dd4-11f0-a6da-024265656508] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,693 xinference.core.worker 440 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,693 xinference.core.worker 440 DEBUG Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,694 xinference.core.supervisor 440 DEBUG [request 6168231c-0dd4-11f0-a6da-024265656508] Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,886 xinference.core.supervisor 440 DEBUG [request 6185bed6-0dd4-11f0-a6da-024265656508] Enter get_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,887 xinference.core.worker 440 DEBUG Enter get_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,887 xinference.core.worker 440 DEBUG Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,888 xinference.core.supervisor 440 DEBUG [request 6185bed6-0dd4-11f0-a6da-024265656508] Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,890 xinference.core.supervisor 440 DEBUG [request 61863dde-0dd4-11f0-a6da-024265656508] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,890 xinference.core.worker 440 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,890 xinference.core.worker 440 DEBUG Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,891 xinference.core.supervisor 440 DEBUG [request 61863dde-0dd4-11f0-a6da-024265656508] Leave describe_model, elapsed time: 0 s ^C2025-03-30 19:08:37,869 xinference.core.supervisor 440 DEBUG [request 0ef630dc-0dd5-11f0-a6da-024265656508] Enter remove_worker, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,0.0.0.0:33152, kwargs: 5.单独使用vllm进行部署,5并发所有请求均可完成

Reproduction / 复现过程

1.启动docker docker run -d --env NVIDIA_VISIBLE_DEVICES=0,1,2,3 -v /dev/shm:/dev/shm -v /modelscope:/modelscope xprobe/xinference:v1.3.1.post1 sleep 300000000 2.进容器启动 xinference launch --endpoint "http://0.0.0.0:9997" --model-path /modelscope/Qwen2.5-VL-7B-Instruct/ --model-type LLM --model-uid 7b --model-engine vllm --n-gpu 2 --limit_mm_per_prompt “{\“image\”:20}” --max_model_len 32000 3.发起第一轮5并发请求,单个请求包含20张图片,第一轮所有请求均完成。 4.发起第二轮5并发请求,xinference日志输出以下内容后无任何内容输出,请求堵塞住,服务不可用: 2025-03-30 19:03:46,481 xinference.core.worker 440 DEBUG Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,482 xinference.core.supervisor 440 DEBUG [request 6147c72a-0dd4-11f0-a6da-024265656508] Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,484 xinference.core.supervisor 440 DEBUG [request 614846d2-0dd4-11f0-a6da-024265656508] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,484 xinference.core.worker 440 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,484 xinference.core.worker 440 DEBUG Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,485 xinference.core.supervisor 440 DEBUG [request 614846d2-0dd4-11f0-a6da-024265656508] Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,689 xinference.core.supervisor 440 DEBUG [request 6167a810-0dd4-11f0-a6da-024265656508] Enter get_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,690 xinference.core.worker 440 DEBUG Enter get_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,690 xinference.core.worker 440 DEBUG Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,690 xinference.core.supervisor 440 DEBUG [request 6167a810-0dd4-11f0-a6da-024265656508] Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,692 xinference.core.supervisor 440 DEBUG [request 6168231c-0dd4-11f0-a6da-024265656508] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,693 xinference.core.worker 440 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,693 xinference.core.worker 440 DEBUG Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,694 xinference.core.supervisor 440 DEBUG [request 6168231c-0dd4-11f0-a6da-024265656508] Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,886 xinference.core.supervisor 440 DEBUG [request 6185bed6-0dd4-11f0-a6da-024265656508] Enter get_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,887 xinference.core.worker 440 DEBUG Enter get_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,887 xinference.core.worker 440 DEBUG Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,888 xinference.core.supervisor 440 DEBUG [request 6185bed6-0dd4-11f0-a6da-024265656508] Leave get_model, elapsed time: 0 s 2025-03-30 19:03:46,890 xinference.core.supervisor 440 DEBUG [request 61863dde-0dd4-11f0-a6da-024265656508] Enter describe_model, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,7b, kwargs: 2025-03-30 19:03:46,890 xinference.core.worker 440 DEBUG Enter describe_model, args: <xinference.core.worker.WorkerActor object at 0x7f46e93b0bd0>, kwargs: model_uid=7b-0 2025-03-30 19:03:46,890 xinference.core.worker 440 DEBUG Leave describe_model, elapsed time: 0 s 2025-03-30 19:03:46,891 xinference.core.supervisor 440 DEBUG [request 61863dde-0dd4-11f0-a6da-024265656508] Leave describe_model, elapsed time: 0 s ^C2025-03-30 19:08:37,869 xinference.core.supervisor 440 DEBUG [request 0ef630dc-0dd5-11f0-a6da-024265656508] Enter remove_worker, args: <xinference.core.supervisor.SupervisorActor object at 0x7f46e937a480>,0.0.0.0:33152, kwargs: 5.单独使用vllm进行部署,5并发所有请求均可完成

Expected behavior / 期待表现

xinference能正常接受并发请求

kelliaao avatar Mar 31 '25 02:03 kelliaao

简单来说,两次,每次5个请求,第二次会卡住?

qinxuye avatar Mar 31 '25 05:03 qinxuye

简单来说,两次,每次5个请求,第二次会卡住?

@qinxuye 对的,现象是这个

kelliaao avatar Mar 31 '25 06:03 kelliaao

简单来说,两次,每次5个请求,第二次会卡住?

@qinxuye 补充一下,流式请求不会有这个现象,非流式才会有

kelliaao avatar Mar 31 '25 10:03 kelliaao

简单来说,两次,每次5个请求,第二次会卡住?

@qinxuye 补充一下,流式请求不会有这个现象,非流式才会有

有点奇怪。我们排查下。

qinxuye avatar Mar 31 '25 10:03 qinxuye

我单次请求3张图片或以上就出错,就解决方法

简单来说,两次,每次5个请求,第二次会卡住?

@qinxuye 补充一下,流式请求不会有这个现象,非流式才会有

有点奇怪。我们排查下。

kandada avatar Apr 01 '25 07:04 kandada

我单次请求3张图片或以上就出错,就解决方法

简单来说,两次,每次5个请求,第二次会卡住?

@qinxuye 补充一下,流式请求不会有这个现象,非流式才会有

有点奇怪。我们排查下。

你可能是 limit_mm_per_prompt 参数的原因,vllm 要设这个参数决定上下文最多多少张图片,默认是 2.

qinxuye avatar Apr 01 '25 07:04 qinxuye

我单次请求3张图片或以上就出错,就解决方法

简单来说,两次,每次5个请求,第二次会卡住?

@qinxuye 补充一下,流式请求不会有这个现象,非流式才会有

有点奇怪。我们排查下。

你可能是 limit_mm_per_prompt 参数的原因,vllm 要设这个参数决定上下文最多多少张图片,默认是 2.

界面上将limit_mm_per_prompt 设置为4,启动不了,报错:Server error: 500 - [address=0.0.0.0:38827, pid=2768712] the JSON object must be str, bytes or bytearray, not int

kandada avatar Apr 01 '25 07:04 kandada

我单次请求3张图片或以上就出错,就解决方法

简单来说,两次,每次5个请求,第二次会卡住?

@qinxuye 补充一下,流式请求不会有这个现象,非流式才会有

有点奇怪。我们排查下。

你可能是 limit_mm_per_prompt 参数的原因,vllm 要设这个参数决定上下文最多多少张图片,默认是 2.

界面上将limit_mm_per_prompt 设置为4,启动不了,报错:Server error: 500 - [address=0.0.0.0:38827, pid=2768712] the JSON object must be str, bytes or bytearray, not int

@kandada 试一下用json格式,{“image”:10}这种

kelliaao avatar Apr 02 '25 02:04 kelliaao

简单来说,两次,每次5个请求,第二次会卡住?

@qinxuye 补充一下,流式请求不会有这个现象,非流式才会有

有点奇怪。我们排查下。

这个你们有复现吗,我也感觉挺奇怪的,会不会非流式处理哪里有问题

kelliaao avatar Apr 02 '25 02:04 kelliaao

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar Apr 09 '25 19:04 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

github-actions[bot] avatar Apr 14 '25 19:04 github-actions[bot]

@qinxuye 这个有定位到什么问题吗,我是在容器里面手动更新了vllm的版本,pip install vllm==0.7.3,会是版本有冲突吗

kelliaao avatar Apr 16 '25 08:04 kelliaao

@qinxuye 我发现这个复现不用这么麻烦,qwen2-vl系列也有这个问题。只要5并发非流式请求,请求里带张稍微大点的图片,第二轮请求就会全部堵塞住

kelliaao avatar Apr 16 '25 10:04 kelliaao

最新的版本加上了图片的 max_pixels 的限制,不知道是否有影响。等新版出来再试下?

qinxuye avatar Apr 16 '25 13:04 qinxuye

最新的版本加上了图片的 max_pixels 的限制,不知道是否有影响。等新版出来再试下?

@qinxuye 试了下新版本,还是会有这个问题

kelliaao avatar Apr 21 '25 07:04 kelliaao

卡住的现象是,部署的两张显卡,一张利用率一直为100,另一个是0,cpu占用也是100,ctrl+c关闭xinference后,占用率100的显卡显存释放不了。奇怪的是同样的请求,流式正常,非流式会卡住,这俩按理说调用的是vllm的同一个接口,参数也一样的

kelliaao avatar Apr 21 '25 07:04 kelliaao

这么怪异,看着 vllm 进程是不是僵尸了?

qinxuye avatar Apr 21 '25 07:04 qinxuye

这么怪异,看着 vllm 进程是不是僵尸了?

占用GPU和CPU的进程一开始是running,然后变成sleeping

kelliaao avatar Apr 21 '25 12:04 kelliaao

这么怪异,看着 vllm 进程是不是僵尸了?

@qinxuye 这个你们有计划排查下吗,还挺容易出现的,只要是非流式,并发场景就会出现

kelliaao avatar Apr 23 '25 02:04 kelliaao

这么怪异,看着 vllm 进程是不是僵尸了?

@qinxuye 这个你们有计划排查下吗,还挺容易出现的,只要是非流式,并发场景就会出现

下周发版之前把,我们排查下。

qinxuye avatar Apr 23 '25 02:04 qinxuye

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar May 01 '25 19:05 github-actions[bot]

@qinxuye 这个问题有复现嘛,我看现在的最新版本好像没有解决这个问题

kelliaao avatar May 06 '25 03:05 kelliaao

@amumu96 帮忙定位下。

qinxuye avatar May 06 '25 04:05 qinxuye

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar May 13 '25 19:05 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

github-actions[bot] avatar May 19 '25 19:05 github-actions[bot]

This issue is stale because it has been open for 7 days with no activity.

github-actions[bot] avatar May 27 '25 19:05 github-actions[bot]

This issue was closed because it has been inactive for 5 days since being marked as stale.

github-actions[bot] avatar Jun 02 '25 19:06 github-actions[bot]

我这边使用最新版的xinference部署模型后未复现异常,能进入到容器里面拉下执行操作时的控制台的输出吗?

llyycchhee avatar Jun 09 '25 06:06 llyycchhee

从去年十二月份我在xinference默认参数启动vl模型就出现了这个问题,一旦并发很快就会卡住,而且每次输入只有一张图片,中间更新了xinference几个版本,依旧存在这个问题,不能并发,目前使用xinference的版本为1.5.1仍存在此问题,不知最近版本是否有优化此问题

pyaaaa avatar Jun 18 '25 06:06 pyaaaa

测试脚本能发下吗?我来看吧。

qinxuye avatar Jun 18 '25 07:06 qinxuye