Tensor parallel vllm fails on eGPU
Describe the bug
When I try to run the vLLM container https://github.com/intel/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md with tensor_parallel_size=2 on my 2 A770 system where the second A770 is connected to the m2 slot of mainboard I get
> dmesg
[ 1470.497571] i915 0000:07:00.0: [drm] GPU HANG: ecode 12:10:85def5fa, in ray::WrapperWit [14631]
[ 1470.497577] i915 0000:07:00.0: [drm] ray::IDLE[14631] context reset due to GPU hang
[ 1470.612921] i915 0000:03:00.0: [drm] GPU HANG: ecode 12:10:85def5fa, in python [14382]
[ 1470.612925] i915 0000:03:00.0: [drm] python[14382] context reset due to GPU hang
and vLLM crashes with UR_RESULT_ERROR_DEVICE_LOST.
The individual GPUs work when I set ONEAPI_DEVICE_SELECTOR and tensor_parallel_size=1
Is this a driver problem? Should a second A770 connected as m2 eGPU work or is this unsupported?
How to reproduce Tough - you need my exotic setup
Environment information
❯ sudo bash env-check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.13.3
-----------------------------------------------------------------
Transformers is not installed.
-----------------------------------------------------------------
PyTorch is not installed.
-----------------------------------------------------------------
ipex-llm WARNING: Package(s) not found: ipex-llm
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 46 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Vendor ID: GenuineIntel
BIOS Vendor ID: Intel(R) Corporation
Model name: 12th Gen Intel(R) Core(TM) i5-12500
BIOS Model name: 12th Gen Intel(R) Core(TM) i5-12500 To Be Filled By O.E.M. CPU @ 4.0GHz
BIOS CPU family: 205
CPU family: 6
Model: 151
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
Stepping: 5
-----------------------------------------------------------------
Total CPU Memory: 45.6102 GB
Memory Type: DDR5
-----------------------------------------------------------------
Operating System:
Ubuntu 25.04 \n \l
-----------------------------------------------------------------
Linux ailab 6.11.0-25-generic #25-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 11 23:29:18 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
Version: 1.2.39.20241101
Build ID: 00000000
Service:
Version: 1.2.39.20241101
Build ID: 00000000
Level Zero Version: 1.20.2
-----------------------------------------------------------------
Driver UUID 32352e30-392e-3332-3936-310000000000
Driver Version 25.09.32961
Driver UUID 32352e30-392e-3332-3936-310000000000
Driver Version 25.09.32961
Driver UUID 32352e30-392e-3332-3936-310000000000
Driver Version 25.09.32961
-----------------------------------------------------------------
Driver related package version:
rc intel-fw-gpu 2024.24.5-337~22.04 all Firmware package for Intel integrated and discrete GPUs
ii intel-level-zero-gpu-raytracing 1.0.0-0ubuntu1~24.10~ppa4 amd64 Level Zero Ray Tracing Support library
-----------------------------------------------------------------
env-check.sh: line 167: sycl-ls: command not found
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Device Name: Intel(R) Arc(TM) A770 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0003-0000-000856a08086 |
| | PCI BDF Address: 0000:03:00.0 |
| | DRM Device: /dev/dri/card1 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 1 | Device Name: Intel(R) Arc(TM) A770 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0007-0000-000856a08086 |
| | PCI BDF Address: 0000:07:00.0 |
| | DRM Device: /dev/dri/card0 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory size=16G
GPU1 Memory size=16G
GPU2 Memory size=16G
-----------------------------------------------------------------
03:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
Subsystem: Intel Corporation Device 1020
Flags: bus master, fast devsel, latency 0, IRQ 188, IOMMU group 19
Memory at 73000000 (64-bit, non-prefetchable) [size=16M]
Memory at 5000000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at 74000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, IntMsgNum 0
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
07:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
Subsystem: Device 172f:3937
Flags: bus master, fast devsel, latency 0, IRQ 192, IOMMU group 24
Memory at 71000000 (64-bit, non-prefetchable) [size=16M]
Memory at 4800000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at 72000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, IntMsgNum 0
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
7a:00.0 VGA compatible controller: Intel Corporation Battlemage G21 [Arc B580] (prog-if 00 [VGA controller])
Subsystem: Intel Corporation Device 1100
Flags: bus master, fast devsel, latency 0, IRQ 196, IOMMU group 39
Memory at 6f000000 (64-bit, non-prefetchable) [size=16M]
Memory at 4000000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at 70000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, IntMsgNum 0
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
-----------------------------------------------------------------
vLLM freaks out if used with tensor parallel in any of the newer docker images past b12. I also had this problem. To be honest, I ended up testing out the llama.cpp portable build from the 2.3.0-nightly and it blew my socks off. Qwen3-30B-A3B-Q4 gguf from unSloth was running at 45 tokens per second on smaller context queries and at 30 tokens per second at larger ones. That's with the model + context spread across 3 GPUs but you could easily run it on 2 given the size it takes. Trying to raise context above 22528 does result in an error, but I've lodged it as a bug.
Do you run the portable "bare metal" or in a docker container?
I've been running it bare metal and it's been going well. I did look at getting it working in a container but was getting shaderdumps all over the place. One thing I will say is that the nightly releases are kinda slow to release to github, so I'm working on a docker image with the right stuff installed to compile the modules myself.
Hi, I wonder if you have installed the Intel out-of-tree driver?
https://dgpu-docs.intel.com/driver/kernel-driver-types.html#out-of-tree-drivers
Also, can you post your start script for vLLM service? I saw you have set some environment varibles.
I did install that but since updated to plucky. That's probably a problem. How do I see if I use the out of tree driver?
My startup script (slightly extended from the one provided):
#!/bin/bash
MODEL_PATH=${MODEL_PATH:-"default_model_path"}
SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"default_model_name"}
TENSOR_PARALLEL_SIZE=${TENSOR_PARALLEL_SIZE:-1}
MAX_NUM_SEQS=${MAX_NUM_SEQS:-256}
MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS:-3000}
MAX_MODEL_LEN=${MAX_MODEL_LEN:-2000}
LOAD_IN_LOW_BIT=${LOAD_IN_LOW_BIT:-"fp8"}
PORT=${PORT:-8000}
VLLM_QUANTIZATION=${VLLM_QUANTIZATION:-""} # Default to empty (no -q argument)
CACHE_DTYPE=${CACHE_DTYPE:-""} # Default to empty (no --kv-cache-dtype argument)
DOWNLOAD_DIR=${DOWNLOAD_DIR:-"/llm/models"} # Default download directory
PREFIX_CACHING=${PREFIX_CACHING:-"0"} # Default to 0 (disabled)
echo "Starting service with model: $MODEL_PATH"
echo "Served model name: $SERVED_MODEL_NAME"
echo "Tensor parallel size: $TENSOR_PARALLEL_SIZE"
echo "Max num sequences: $MAX_NUM_SEQS"
echo "Max num batched tokens: $MAX_NUM_BATCHED_TOKENS"
echo "Max model length: $MAX_MODEL_LEN"
echo "Load in low bit: $LOAD_IN_LOW_BIT"
echo "Port: $PORT"
if [[ -n "$VLLM_QUANTIZATION" ]]; then
echo "Quantization method: $VLLM_QUANTIZATION"
else
echo "Quantization method: Not specified (default)"
fi
if [[ -n "$CACHE_DTYPE" ]]; then
echo "KV Cache DType: $CACHE_DTYPE"
else
echo "KV Cache DType: Not specified (default)"
fi
echo "Download directory: $DOWNLOAD_DIR"
if [[ "$PREFIX_CACHING" == "1" ]]; then
echo "Prefix Caching: Enabled"
else
echo "Prefix Caching: Disabled"
fi
export CCL_WORKER_COUNT=2
export SYCL_CACHE_PERSISTENT=1
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0
export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0
export VLLM_USE_V1=0
export IPEX_LLM_LOWBIT=$LOAD_IN_LOW_BIT
source /opt/intel/1ccl-wks/setvars.sh
# Build the command arguments dynamically
CMD_ARGS=(
--served-model-name "$SERVED_MODEL_NAME"
--port "$PORT"
--model "$MODEL_PATH"
--trust-remote-code
--block-size 8
--gpu-memory-utilization 0.95
--device xpu
--dtype float16
--enforce-eager
--load-in-low-bit "$LOAD_IN_LOW_BIT"
--max-model-len "$MAX_MODEL_LEN"
--max-num-batched-tokens "$MAX_NUM_BATCHED_TOKENS"
--max-num-seqs "$MAX_NUM_SEQS"
--tensor-parallel-size "$TENSOR_PARALLEL_SIZE"
--disable-async-output-proc
--distributed-executor-backend ray
--download-dir "$DOWNLOAD_DIR"
)
# Conditionally add the quantization argument if VLLM_QUANTIZATION is set and not empty
if [[ -n "$VLLM_QUANTIZATION" ]]; then
CMD_ARGS+=(-q "$VLLM_QUANTIZATION")
fi
# Conditionally add the kv cache dtype argument if CACHE_DTYPE is set and not empty
if [[ -n "$CACHE_DTYPE" ]]; then
CMD_ARGS+=(--kv-cache-dtype "$CACHE_DTYPE")
fi
# Conditionally add the prefix caching argument if PREFIX_CACHING is set to 1
if [[ "$PREFIX_CACHING" == "1" ]]; then
CMD_ARGS+=(--enable-prefix-caching)
fi
# Execute the command
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server "${CMD_ARGS[@]}"
Hm interesting. In an older issue I saw the recommendation to uncomment
#source /opt/intel/1ccl-wks/setvars.sh
and now it works.
Pretty low CPU utilization but I'd guess that's a bottleneck from the Gen3 x4 interface of the 2nd Arc.
Hm interesting. In an older issue I saw the recommendation to uncomment
#source /opt/intel/1ccl-wks/setvars.shand now it works.
Pretty low CPU utilization but I'd guess that's a bottleneck from the Gen3 x4 interface of the 2nd Arc.
I wouldn’t incriminate the bus speed so fast, it seems that there are specific optimizations to the Qwen3 MOE arch to be done, like the ones there was added to llama.cpp. Someone correct me if I’m wrong, but my guess is that only one "expert" is activated on each card instead of the intended eight, so it is why it is so slow. On real supported backend, this model is astonishingly fast.
Hi, this problem is caused by incompatible usage between our optimized oneCCL and CPU platform.
We will later release a new version that will fix this problem.
Currently, you can try image: intelanalytics/ipex-llm-serving-xpu:b12-usm or comment out the source /opt/intel/1ccl-wks/setvars.sh
What's this image: intelanalytics/multi-arc-serving:0.2.0-b1 ?
We will later release a new version that will fix this problem
Hello. What version of "ipex-llm-serving-xpu" is currently suitable (the best) for Xeon v3/v4 and Arc A770 x4?
Hi, you can try this one: intelanalytics/ipex-llm-serving-xpu:0.8.3-b20
Hi, you can try this one:
intelanalytics/ipex-llm-serving-xpu:0.8.3-b20
do you can to give me an advice what parameters should I use to run vllm with this topology? xeon 2699 v3 (x2) Cluster-On-Die(COD) in BIOS a770 (x4 pcie gen 3 x16)
sudo xpu-smi diag -d 0-4 --singletest 5 ~~ bandwidth is ~9.909 GBPS
Please try set export CCL_DG2_USM=1