vllm icon indicating copy to clipboard operation
vllm copied to clipboard

[Bug]: using qwen-8B , LLVM ERROR: Failed to compute parent layout for slice layout

Open suwenzhuo opened this issue 7 months ago • 2 comments

Your current environment

The output of `python collect_env.py`
Your output of `python collect_env.py` here

🐛 Describe the bug

vllm 0.8.5

vllm serve /root/model/Qwen3-8B --dtype half --port 8075 --gpu-memory-utilization 0.8

INFO: 115.239.217.175:36366 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO 04-30 14:30:44 [engine.py:310] Added request chatcmpl-dbc9987ce4734f5b8321adfdb5ae22b7. LLVM ERROR: Failed to compute parent layout for slice layout. ERROR 04-30 14:30:50 [client.py:305] RuntimeError('Engine process (pid 2837204) died.')

Before submitting a new issue...

  • [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

suwenzhuo avatar Apr 30 '25 06:04 suwenzhuo

Have you tried v0?

VLLM_USE_V1=0 vllm serve ....

jeejeelee avatar Apr 30 '25 07:04 jeejeelee

Similar issue: https://github.com/vllm-project/vllm/issues/17392

jeejeelee avatar Apr 30 '25 07:04 jeejeelee

VLLM_USE_V1=0 is not working

docker run --gpus all -e VLLM_USE_V1=0 -v ~/.cache/huggingface:/root/.cache/huggingface
-p 8000:8000 --ipc=host vllm/vllm-openai:v0.8.5 --model Qwen/Qwen3-30B-A3B --tensor-parallel-size 4 --dtype=half --enable-reasoning --reasoning-parser deepseek_r1 --max-model-len 32768 --enforce-eager --no-enable-chunked-prefill --max-model-len 16384

ShuhaoYuan avatar May 01 '25 07:05 ShuhaoYuan

Try --dtype float32?

houseroad avatar May 02 '25 07:05 houseroad

I save tihs by using --no-enable-chunked-prefill

examlpe: vllm serve /root/model/Qwen3-8B --dtype half --port 8075 --gpu-memory-utilization 0.8 --no-enable-chunked-prefill --max-model-len 8000

The possible reason is that v100 does not support chunked-prefill

suwenzhuo avatar May 06 '25 06:05 suwenzhuo