4xARC 770 on w5-3423 slow performance (half of what is reported here)
Describe the bug
I observe quite bad performance with my 4xARC770 setup with xeon w5-3423, 128 GB DDR5 and Asus w790 ACE. Main Question: I suppose with this setup i'm quite close to what you got. So what am i missing. What should be enabled in bios. Could you share a bit more on how did you configure your system etc...?
This is really getting frustrating i'm almost on the verge on selling it all. (before i tried with EPYC 7282 and i had more or less same numbers).
GPUS are connected to PCIe 5 (does not really matter, as they are PCIe4 anyway). In theory they get good speed (all of them 18 GBPS, so inference should not be a problem)
sudo xpu-smi diag -d 0 --singletest 5
+------------------+-------------------------------------------------------------------------------+
| Device ID | 0 |
+------------------+-------------------------------------------------------------------------------+
| Integration PCIe | Result: Fail |
| | Message: Fail to check PCIe bandwidth. Its bandwidth is 17.995 GBPS. |
| | Unconfigured or invalid threshold. Fail on copy engine group 1. |
+------------------+-------------------------------------------------------------------------------+
I'm on ubuntu 22.04 and i installed everything as in your tutorial. Out of tree, etc. Same kernel.... After quite some testing i even compiled out of tree driver myself from the backports repo.... Same results
I locked freqs:
sudo xpu-smi config -d 0 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 1 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 2 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 3 -t 0 --frequencyrange 2400,2400
sudo cpupower frequency-set -g performance
sudo cpupower frequency-set -d 4.0GHz
and still just a bit of improvement.
Tests with model=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B on 4xARC sym-int4
I get roughly half of what you are getting whatever i do
============ Serving Benchmark Result ============
Successful requests: 1
Benchmark duration (s): 35.40
Total input tokens: 1024
Total generated tokens: 512
Request throughput (req/s): 0.03
Output token throughput (tok/s): 14.46
Total Token throughput (tok/s): 43.39
---------------Time to First Token----------------
Mean TTFT (ms): 1332.15
Median TTFT (ms): 1332.15
P99 TTFT (ms): 1332.15
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 66.67
Median TPOT (ms): 66.67
P99 TPOT (ms): 66.67
---------------Inter-token Latency----------------
Mean ITL (ms): 66.67
Median ITL (ms): 66.36
P99 ITL (ms): 74.45
==================================================
Also my GPUs don't go above 110W for a signle request (when i do 32 requrest then it goes up by a lot, but not for a single (the picture is from FP8 though))
Picture attached
My tests were performed with a custom docker based on intelanalytics/ipex-llm-serving-xpu:latest (build today). But i tried a lot other builds and the performance does not really differ.
One observation: CCL_WORKER_COUNT=1 is the fastest
FROM intelanalytics/ipex-llm-serving-xpu:latest
WORKDIR /temp
SHELL ["/bin/bash", "-c"]
WORKDIR /llm
RUN . /opt/intel/1ccl-wks/setvars.sh
ENTRYPOINT numactl -C 0-11 python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
--served-model-name ${served_model_name} \
--quantization ${quantization} \
--model $model \
--port $port \
--trust-remote-code \
--block-size ${block_size} \
--gpu-memory-utilization ${gpu_memory_utilization} \
--device xpu \
--dtype $dtype \
--enforce-eager \
--load-in-low-bit ${load_in_low_bit} \
--max-model-len ${max_model_len} \
--max-num-batched-tokens ${max_num_batched_tokens} \
--max-num-seqs ${max_num_seqs} \
--tensor-parallel-size ${tensor_parallel_size} \
--pipeline-parallel-size ${pipeline_parallel_size} \
--disable-async-output-proc \
--distributed-executor-backend ray
services:
vllm-ipex:
image: intelanalytics/ipex-llm-serving-xpu-custom:latest
container_name: vllm-ipex
build:
dockerfile: ./dockerfile/dockerfile
volumes:
- "/models/huggingface:/root/.cache/huggingface"
- /cloud/custom/ipex-llm/:/ipex-llm
- /etc/timezone:/etc/timezone:ro
- /etc/localtime:/etc/localtime:ro
devices:
- /dev/dri:/dev/dri
privileged: true
tty: true
ports:
- 8000:8000
shm_size: "32g"
environment:
# Model selection
- quantization=None
- model=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
- served_model_name=DeepSeek-R1-Distill-Qwen-32B
# Timezone and device
- TZ=Europe/Berlin
- DEVICE=Arc
# Intel/OneAPI/CCL/SYCL
- SYCL_CACHE_PERSISTENT=1
- CCL_WORKER_COUNT=1
- FI_PROVIDER=shm
- CCL_ATL_TRANSPORT=ofi
- CCL_ZE_IPC_EXCHANGE=sockets
- CCL_ATL_SHM=1
- USE_XETLA=OFF
- SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
- TORCH_LLM_ALLREDUCE=0
- VLLM_USE_V1=0
- CCL_SAME_STREAM=1
- CCL_BLOCKING_WAIT=0
- IPEX_LLM_LOWBIT=fp8
# vLLM/Serving
- port=8000
- gpu_memory_utilization=0.9
- dtype=float16
- block_size=8
- load_in_low_bit=sym_int4
- max_model_len=9000
- max_num_batched_tokens=9000
- max_num_seqs=32
- tensor_parallel_size=4
- pipeline_parallel_size=1
- enforce_eager=true
restart: unless-stopped
logging:
driver: json-file
options:
max-size: "10mb"
max-file: "1"
ENV:
sudo ./env-check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.10.12
-----------------------------------------------------------------
Transformers is not installed.
-----------------------------------------------------------------
PyTorch is not installed.
-----------------------------------------------------------------
ipex-llm ./env-check.sh: line 58: pip: command not found
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 52 bits physical, 57 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: GenuineIntel
Model name: Intel(R) Xeon(R) w5-3423
CPU family: 6
Model: 143
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
Stepping: 8
CPU max MHz: 4200.0000
CPU min MHz: 800.0000
BogoMIPS: 4224.00
-----------------------------------------------------------------
Total CPU Memory: 125.294 GB
Memory Type: DDR5
-----------------------------------------------------------------
Operating System:
Ubuntu 22.04.5 LTS \n \l
-----------------------------------------------------------------
Linux aiflek 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
Version: 1.2.41.20250422
Build ID: 00000000
Service:
Version: 1.2.41.20250422
Build ID: 00000000
Level Zero Version: 1.21.1
-----------------------------------------------------------------
Driver UUID 32352e31-332e-3333-3237-360000000000
Driver Version 25.13.33276
Driver UUID 32352e31-332e-3333-3237-360000000000
Driver Version 25.13.33276
Driver UUID 32352e31-332e-3333-3237-360000000000
Driver Version 25.13.33276
Driver UUID 32352e31-332e-3333-3237-360000000000
Driver Version 25.13.33276
-----------------------------------------------------------------
Driver related package version:
ii intel-fw-gpu 2025.13.2-398~22.04 all Firmware package for Intel integrated and discrete GPUs
ii intel-i915-dkms 1.25.1.17.250113.16+i1-1 all Out of tree i915 driver.
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information |
+-----------+--------------------------------------------------------------------------------------+
| 0 | Device Name: Intel(R) Arc(TM) A770 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0018-0000-000856a08086 |
| | PCI BDF Address: 0000:18:00.0 |
| | DRM Device: /dev/dri/card0 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 1 | Device Name: Intel(R) Arc(TM) A770 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0036-0000-000856a08086 |
| | PCI BDF Address: 0000:36:00.0 |
| | DRM Device: /dev/dri/card1 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 2 | Device Name: Intel(R) Arc(TM) A770 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0054-0000-000856a08086 |
| | PCI BDF Address: 0000:54:00.0 |
| | DRM Device: /dev/dri/card2 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
| 3 | Device Name: Intel(R) Arc(TM) A770 Graphics |
| | Vendor Name: Intel(R) Corporation |
| | SOC UUID: 00000000-0000-0072-0000-000856a08086 |
| | PCI BDF Address: 0000:72:00.0 |
| | DRM Device: /dev/dri/card3 |
| | Function Type: physical |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory size=16G
GPU1 Memory size=16G
GPU2 Memory size=16G
GPU3 Memory size=16G
-----------------------------------------------------------------
18:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Device 172f:3937
Flags: bus master, fast devsel, latency 0, IRQ 91, NUMA node 0
Memory at 9e000000 (64-bit, non-prefetchable) [size=16M]
Memory at 2f800000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at 9f000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
36:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Intel Corporation Device 1020
Flags: bus master, fast devsel, latency 0, IRQ 94, NUMA node 0
Memory at a8000000 (64-bit, non-prefetchable) [size=16M]
Memory at 3f800000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at a9000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
54:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Intel Corporation Device 1020
Flags: bus master, fast devsel, latency 0, IRQ 97, NUMA node 0
Memory at b3000000 (64-bit, non-prefetchable) [size=16M]
Memory at 4f800000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at b4000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
72:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
Subsystem: Intel Corporation Device 1020
Flags: bus master, fast devsel, latency 0, IRQ 100, NUMA node 0
Memory at bd000000 (64-bit, non-prefetchable) [size=16M]
Memory at 5f800000000 (64-bit, prefetchable) [size=16G]
Expansion ROM at be000000 [disabled] [size=2M]
Capabilities: [40] Vendor Specific Information: Len=0c <?>
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
-----------------------------------------------------------------
SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
Try to change 2 to 0 (may be it brings to you some t/s)
ps And i have same problem with compute and power 😭 B16, b19, b20, b21.. better single thread only with b12-usm.
It seems that there is nothing wrong with your test related configuration and OS kernel. Maybe this error is caused by the newer driver:
Driver related package version:
ii intel-fw-gpu 2025.13.2-398~22.04 all Firmware package for Intel integrated and discrete GPUs
ii intel-i915-dkms 1.25.1.17.250113.16+i1-1 all Out of tree i915 driver.
You can follow this guide Option 2 to install and test again.
Hi @hzjane, no change, i tested it with 4 other "official" out of tree drivers. Same numbers or maybe even a bit lower on some.
i reinstalled ubuntu, first tried with 5.15 kernel and to my surprise VLLM always crashed. Llama was running with the same speed as with the previous kernel and divers.
Then i went up to the good kernel and good driver. Same speed.
So what could it be? what do you do differently?
Could it be single core performance? xeon w5-3423 goes up to 4.2GHz, so i suppose this is ok.
What else could i try?
Now i think i will go up to ubuntu 24.04. Check there.
Any other ideas?
5.15 kernel
https://cdrdv2-public.intel.com/828236/828236_Installation%20BKC%20and%20AI%20Benchmark%20UG%20on%20Intel%20Xeon_ARC%20A770_rev2.2.pdf
Did you read this document (user guide 828236, v2.2)
Thanks a lot!
I have not seen this document yet. But I think I tried all of them. I found another guide once upon a time at intel. And it always is like 2x the performance that I'm getting.
Really really weird. The weird thing is that Llama.cpp provides sometimes better performance even on multi arc config. Which seems wrong :)
It does not have tensor parallel. So yeah, i check the firmware and it seems also ok.
Llama.cpp
To be honest I also have speed problems on vllm.
Using llama.3.1 (q4_k_m) as an example.
-
At various times got up to 61 tps on llama.cpp. Recently on the latest versions of ipex-llm and kernel 6.14 stable 57 t/s. And the speed didn't drop much if you run the model on 2-3-4 cards. On average by 2-3 tokens when adding each card. That is from 57 to 43 t/s
-
And on docker vllm max 30 t/s (b12-usm) in one thread. On b21 - 16-20 t/s, on several cards 4 times slower (((( about 4.5 t/s And only if in many threads I managed to get 350-500 and even up to 700 t/s on one card 😂 (I ran my own script with parallel requests).
I really have worse hardware than you. I have xeon 2699v3 (2x) + 128gb ddr4( 2133Mhz). But also 4 a770 cards.
@flekol Maybe this issue is caused by the 1ccl don't work normally. In your dockerfile, you use RUN . /opt/intel/1ccl-wks/setvars.sh to instead of source /opt/intel/1ccl-wks/setvars.sh. I think it may not work when vllm-service is actually running. You can try to run this example in the official image container first (instead of buiding your image) and to check whether the preformance is normal.
@hzjane i think this was it. I am still lacking bit behind your results -> Could it be as my cpu is not as powerful as yours in???
It was mentioned in #13173 that vllm V1 might be less CPU hungry.
Any plans on upgrading?
i also see a lot more movement from intel in vllm repo nowadays, so will ipex continue updating vllm or soon we can use the main vllm repo?
Do you have any roadmap for ipex?
a lot more movement from intel in vllm repo
you mean this one? https://github.com/vllm-project/vllm/commit/b69781f107b7ad847a351f584178cfafbee2b32a
Yes
Lately there was more movement in the prs as well (although reviews are super slow from vllm team.)
Also I think intel said that when the incommin b60 will hit the market, vllm will be up to the task.
Plus pytorch 2.8 will suport xpu natively (without extensions). I'm really curios what will happen.
Plus pytorch 2.8
Hm... I think it works from 2.7.. 🤔 Yes, vllm team is slow ))) I hope that this will be the case, that by the fall we will receive a truly reliable vllm)