ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

Tensor parallel vllm fails on eGPU

Open kirel opened this issue 8 months ago • 13 comments

Describe the bug When I try to run the vLLM container https://github.com/intel/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md with tensor_parallel_size=2 on my 2 A770 system where the second A770 is connected to the m2 slot of mainboard I get

> dmesg
[ 1470.497571] i915 0000:07:00.0: [drm] GPU HANG: ecode 12:10:85def5fa, in ray::WrapperWit [14631]
[ 1470.497577] i915 0000:07:00.0: [drm] ray::IDLE[14631] context reset due to GPU hang
[ 1470.612921] i915 0000:03:00.0: [drm] GPU HANG: ecode 12:10:85def5fa, in python [14382]
[ 1470.612925] i915 0000:03:00.0: [drm] python[14382] context reset due to GPU hang

and vLLM crashes with UR_RESULT_ERROR_DEVICE_LOST.

The individual GPUs work when I set ONEAPI_DEVICE_SELECTOR and tensor_parallel_size=1

Is this a driver problem? Should a second A770 connected as m2 eGPU work or is this unsupported?

How to reproduce Tough - you need my exotic setup

Environment information

❯ sudo bash env-check.sh 
-----------------------------------------------------------------
PYTHON_VERSION=3.13.3
-----------------------------------------------------------------
Transformers is not installed. 
-----------------------------------------------------------------
PyTorch is not installed. 
-----------------------------------------------------------------
ipex-llm WARNING: Package(s) not found: ipex-llm
-----------------------------------------------------------------
IPEX is not installed. 
-----------------------------------------------------------------
CPU Information: 
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        46 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               12
On-line CPU(s) list:                  0-11
Vendor ID:                            GenuineIntel
BIOS Vendor ID:                       Intel(R) Corporation
Model name:                           12th Gen Intel(R) Core(TM) i5-12500
BIOS Model name:                      12th Gen Intel(R) Core(TM) i5-12500 To Be Filled By O.E.M. CPU @ 4.0GHz
BIOS CPU family:                      205
CPU family:                           6
Model:                                151
Thread(s) per core:                   2
Core(s) per socket:                   6
Socket(s):                            1
Stepping:                             5
-----------------------------------------------------------------
Total CPU Memory: 45.6102 GB
Memory Type: DDR5 
-----------------------------------------------------------------
Operating System: 
Ubuntu 25.04 \n \l

-----------------------------------------------------------------
Linux ailab 6.11.0-25-generic #25-Ubuntu SMP PREEMPT_DYNAMIC Fri Apr 11 23:29:18 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.39.20241101
    Build ID: 00000000

Service:
    Version: 1.2.39.20241101
    Build ID: 00000000
    Level Zero Version: 1.20.2
-----------------------------------------------------------------
  Driver UUID                                     32352e30-392e-3332-3936-310000000000
  Driver Version                                  25.09.32961
  Driver UUID                                     32352e30-392e-3332-3936-310000000000
  Driver Version                                  25.09.32961
  Driver UUID                                     32352e30-392e-3332-3936-310000000000
  Driver Version                                  25.09.32961
-----------------------------------------------------------------
Driver related package version:
rc  intel-fw-gpu                                   2024.24.5-337~22.04                        all          Firmware package for Intel integrated and discrete GPUs
ii  intel-level-zero-gpu-raytracing                1.0.0-0ubuntu1~24.10~ppa4                  amd64        Level Zero Ray Tracing Support library
-----------------------------------------------------------------
env-check.sh: line 167: sycl-ls: command not found
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed. 
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0003-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:03:00.0                                                        |
|           | DRM Device: /dev/dri/card1                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 1         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0007-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:07:00.0                                                        |
|           | DRM Device: /dev/dri/card0                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory size=16G
GPU1 Memory size=16G
GPU2 Memory size=16G
-----------------------------------------------------------------
03:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 1020
        Flags: bus master, fast devsel, latency 0, IRQ 188, IOMMU group 19
        Memory at 73000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 5000000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at 74000000 [disabled] [size=2M]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Endpoint, IntMsgNum 0
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
07:00.0 VGA compatible controller: Intel Corporation DG2 [Arc A770] (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Device 172f:3937
        Flags: bus master, fast devsel, latency 0, IRQ 192, IOMMU group 24
        Memory at 71000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 4800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at 72000000 [disabled] [size=2M]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Endpoint, IntMsgNum 0
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
7a:00.0 VGA compatible controller: Intel Corporation Battlemage G21 [Arc B580] (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 1100
        Flags: bus master, fast devsel, latency 0, IRQ 196, IOMMU group 39
        Memory at 6f000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 4000000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at 70000000 [disabled] [size=2M]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Endpoint, IntMsgNum 0
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
-----------------------------------------------------------------

kirel avatar May 03 '25 20:05 kirel

vLLM freaks out if used with tensor parallel in any of the newer docker images past b12. I also had this problem. To be honest, I ended up testing out the llama.cpp portable build from the 2.3.0-nightly and it blew my socks off. Qwen3-30B-A3B-Q4 gguf from unSloth was running at 45 tokens per second on smaller context queries and at 30 tokens per second at larger ones. That's with the model + context spread across 3 GPUs but you could easily run it on 2 given the size it takes. Trying to raise context above 22528 does result in an error, but I've lodged it as a bug.

HumerousGorgon avatar May 05 '25 08:05 HumerousGorgon

Do you run the portable "bare metal" or in a docker container?

kirel avatar May 05 '25 14:05 kirel

I've been running it bare metal and it's been going well. I did look at getting it working in a container but was getting shaderdumps all over the place. One thing I will say is that the nightly releases are kinda slow to release to github, so I'm working on a docker image with the right stuff installed to compile the modules myself.

HumerousGorgon avatar May 05 '25 14:05 HumerousGorgon

Hi, I wonder if you have installed the Intel out-of-tree driver?

https://dgpu-docs.intel.com/driver/kernel-driver-types.html#out-of-tree-drivers

Also, can you post your start script for vLLM service? I saw you have set some environment varibles.

gc-fu avatar May 06 '25 02:05 gc-fu

I did install that but since updated to plucky. That's probably a problem. How do I see if I use the out of tree driver?

My startup script (slightly extended from the one provided):

#!/bin/bash
MODEL_PATH=${MODEL_PATH:-"default_model_path"}
SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"default_model_name"}
TENSOR_PARALLEL_SIZE=${TENSOR_PARALLEL_SIZE:-1}

MAX_NUM_SEQS=${MAX_NUM_SEQS:-256}
MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS:-3000}
MAX_MODEL_LEN=${MAX_MODEL_LEN:-2000}
LOAD_IN_LOW_BIT=${LOAD_IN_LOW_BIT:-"fp8"}
PORT=${PORT:-8000}

VLLM_QUANTIZATION=${VLLM_QUANTIZATION:-""} # Default to empty (no -q argument)
CACHE_DTYPE=${CACHE_DTYPE:-""} # Default to empty (no --kv-cache-dtype argument)
DOWNLOAD_DIR=${DOWNLOAD_DIR:-"/llm/models"} # Default download directory
PREFIX_CACHING=${PREFIX_CACHING:-"0"} # Default to 0 (disabled)

echo "Starting service with model: $MODEL_PATH"
echo "Served model name: $SERVED_MODEL_NAME"
echo "Tensor parallel size: $TENSOR_PARALLEL_SIZE"
echo "Max num sequences: $MAX_NUM_SEQS"
echo "Max num batched tokens: $MAX_NUM_BATCHED_TOKENS"
echo "Max model length: $MAX_MODEL_LEN"
echo "Load in low bit: $LOAD_IN_LOW_BIT"
echo "Port: $PORT"
if [[ -n "$VLLM_QUANTIZATION" ]]; then
  echo "Quantization method: $VLLM_QUANTIZATION"
else
  echo "Quantization method: Not specified (default)"
fi
if [[ -n "$CACHE_DTYPE" ]]; then
  echo "KV Cache DType: $CACHE_DTYPE"
else
  echo "KV Cache DType: Not specified (default)"
fi
echo "Download directory: $DOWNLOAD_DIR"
if [[ "$PREFIX_CACHING" == "1" ]]; then
  echo "Prefix Caching: Enabled"
else
  echo "Prefix Caching: Disabled"
fi

export CCL_WORKER_COUNT=2
export SYCL_CACHE_PERSISTENT=1
export FI_PROVIDER=shm
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1

export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0

export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0

export VLLM_USE_V1=0
export IPEX_LLM_LOWBIT=$LOAD_IN_LOW_BIT

source /opt/intel/1ccl-wks/setvars.sh

# Build the command arguments dynamically
CMD_ARGS=(
  --served-model-name "$SERVED_MODEL_NAME"
  --port "$PORT"
  --model "$MODEL_PATH"
  --trust-remote-code
  --block-size 8
  --gpu-memory-utilization 0.95
  --device xpu
  --dtype float16
  --enforce-eager
  --load-in-low-bit "$LOAD_IN_LOW_BIT"
  --max-model-len "$MAX_MODEL_LEN"
  --max-num-batched-tokens "$MAX_NUM_BATCHED_TOKENS"
  --max-num-seqs "$MAX_NUM_SEQS"
  --tensor-parallel-size "$TENSOR_PARALLEL_SIZE"
  --disable-async-output-proc
  --distributed-executor-backend ray
  --download-dir "$DOWNLOAD_DIR"
)

# Conditionally add the quantization argument if VLLM_QUANTIZATION is set and not empty
if [[ -n "$VLLM_QUANTIZATION" ]]; then
  CMD_ARGS+=(-q "$VLLM_QUANTIZATION")
fi

# Conditionally add the kv cache dtype argument if CACHE_DTYPE is set and not empty
if [[ -n "$CACHE_DTYPE" ]]; then
  CMD_ARGS+=(--kv-cache-dtype "$CACHE_DTYPE")
fi

# Conditionally add the prefix caching argument if PREFIX_CACHING is set to 1
if [[ "$PREFIX_CACHING" == "1" ]]; then
  CMD_ARGS+=(--enable-prefix-caching)
fi

# Execute the command
python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server "${CMD_ARGS[@]}"

kirel avatar May 06 '25 07:05 kirel

Hm interesting. In an older issue I saw the recommendation to uncomment

#source /opt/intel/1ccl-wks/setvars.sh

and now it works.

Image

Pretty low CPU utilization but I'd guess that's a bottleneck from the Gen3 x4 interface of the 2nd Arc.

kirel avatar May 06 '25 11:05 kirel

Hm interesting. In an older issue I saw the recommendation to uncomment

#source /opt/intel/1ccl-wks/setvars.sh

and now it works.

Image

Pretty low CPU utilization but I'd guess that's a bottleneck from the Gen3 x4 interface of the 2nd Arc.

I wouldn’t incriminate the bus speed so fast, it seems that there are specific optimizations to the Qwen3 MOE arch to be done, like the ones there was added to llama.cpp. Someone correct me if I’m wrong, but my guess is that only one "expert" is activated on each card instead of the intended eight, so it is why it is so slow. On real supported backend, this model is astonishingly fast.

fradav avatar May 06 '25 14:05 fradav

Hi, this problem is caused by incompatible usage between our optimized oneCCL and CPU platform.

We will later release a new version that will fix this problem.

Currently, you can try image: intelanalytics/ipex-llm-serving-xpu:b12-usm or comment out the source /opt/intel/1ccl-wks/setvars.sh

gc-fu avatar May 07 '25 01:05 gc-fu

What's this image: intelanalytics/multi-arc-serving:0.2.0-b1 ?

kirel avatar May 09 '25 06:05 kirel

We will later release a new version that will fix this problem

Hello. What version of "ipex-llm-serving-xpu" is currently suitable (the best) for Xeon v3/v4 and Arc A770 x4?

savvadesogle avatar Jun 02 '25 21:06 savvadesogle

Hi, you can try this one: intelanalytics/ipex-llm-serving-xpu:0.8.3-b20

gc-fu avatar Jun 03 '25 00:06 gc-fu

Hi, you can try this one: intelanalytics/ipex-llm-serving-xpu:0.8.3-b20

do you can to give me an advice what parameters should I use to run vllm with this topology? xeon 2699 v3 (x2) Cluster-On-Die(COD) in BIOS a770 (x4 pcie gen 3 x16)

sudo xpu-smi diag -d 0-4 --singletest 5 ~~ bandwidth is ~9.909 GBPS

Image

savvadesogle avatar Jun 03 '25 18:06 savvadesogle

Please try set export CCL_DG2_USM=1

gc-fu avatar Jun 04 '25 01:06 gc-fu