vllm
vllm copied to clipboard
[Bug]: using qwen-8B , LLVM ERROR: Failed to compute parent layout for slice layout
Your current environment
The output of `python collect_env.py`
Your output of `python collect_env.py` here
🐛 Describe the bug
vllm 0.8.5
vllm serve /root/model/Qwen3-8B --dtype half --port 8075 --gpu-memory-utilization 0.8
INFO: 115.239.217.175:36366 - "POST /v1/chat/completions HTTP/1.1" 200 OK INFO 04-30 14:30:44 [engine.py:310] Added request chatcmpl-dbc9987ce4734f5b8321adfdb5ae22b7. LLVM ERROR: Failed to compute parent layout for slice layout. ERROR 04-30 14:30:50 [client.py:305] RuntimeError('Engine process (pid 2837204) died.')
Before submitting a new issue...
- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
Have you tried v0?
VLLM_USE_V1=0 vllm serve ....
Similar issue: https://github.com/vllm-project/vllm/issues/17392
VLLM_USE_V1=0 is not working
docker run --gpus all -e VLLM_USE_V1=0 -v ~/.cache/huggingface:/root/.cache/huggingface
-p 8000:8000 --ipc=host vllm/vllm-openai:v0.8.5 --model Qwen/Qwen3-30B-A3B --tensor-parallel-size 4 --dtype=half --enable-reasoning --reasoning-parser deepseek_r1 --max-model-len 32768 --enforce-eager --no-enable-chunked-prefill --max-model-len 16384
Try --dtype float32?
I save tihs by using --no-enable-chunked-prefill
examlpe: vllm serve /root/model/Qwen3-8B --dtype half --port 8075 --gpu-memory-utilization 0.8 --no-enable-chunked-prefill --max-model-len 8000
The possible reason is that v100 does not support chunked-prefill