ipex-llm icon indicating copy to clipboard operation
ipex-llm copied to clipboard

4xARC 770 on w5-3423 slow performance (half of what is reported here)

Open flekol opened this issue 6 months ago • 11 comments

Describe the bug

I observe quite bad performance with my 4xARC770 setup with xeon w5-3423, 128 GB DDR5 and Asus w790 ACE. Main Question: I suppose with this setup i'm quite close to what you got. So what am i missing. What should be enabled in bios. Could you share a bit more on how did you configure your system etc...?

This is really getting frustrating i'm almost on the verge on selling it all. (before i tried with EPYC 7282 and i had more or less same numbers).

GPUS are connected to PCIe 5 (does not really matter, as they are PCIe4 anyway). In theory they get good speed (all of them 18 GBPS, so inference should not be a problem)

sudo xpu-smi diag -d 0 --singletest 5

+------------------+-------------------------------------------------------------------------------+
| Device ID        | 0                                                                             |
+------------------+-------------------------------------------------------------------------------+
| Integration PCIe | Result: Fail                                                                  |
|                  | Message: Fail to check PCIe bandwidth. Its bandwidth is 17.995 GBPS.          |
|                  |   Unconfigured or invalid threshold. Fail on copy engine group 1.             |
+------------------+-------------------------------------------------------------------------------+

I'm on ubuntu 22.04 and i installed everything as in your tutorial. Out of tree, etc. Same kernel.... After quite some testing i even compiled out of tree driver myself from the backports repo.... Same results

I locked freqs:

 sudo xpu-smi config -d 0 -t 0 --frequencyrange 2400,2400
 sudo xpu-smi config -d 1 -t 0 --frequencyrange 2400,2400
 sudo xpu-smi config -d 2 -t 0 --frequencyrange 2400,2400
 sudo xpu-smi config -d 3 -t 0 --frequencyrange 2400,2400
sudo cpupower frequency-set -g performance
sudo cpupower frequency-set -d 4.0GHz

and still just a bit of improvement.

Tests with model=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B on 4xARC sym-int4

I get roughly half of what you are getting whatever i do

============ Serving Benchmark Result ============
Successful requests:                     1
Benchmark duration (s):                  35.40
Total input tokens:                      1024
Total generated tokens:                  512
Request throughput (req/s):              0.03
Output token throughput (tok/s):         14.46
Total Token throughput (tok/s):          43.39
---------------Time to First Token----------------
Mean TTFT (ms):                          1332.15
Median TTFT (ms):                        1332.15
P99 TTFT (ms):                           1332.15
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          66.67
Median TPOT (ms):                        66.67
P99 TPOT (ms):                           66.67
---------------Inter-token Latency----------------
Mean ITL (ms):                           66.67
Median ITL (ms):                         66.36
P99 ITL (ms):                            74.45
==================================================


Also my GPUs don't go above 110W for a signle request (when i do 32 requrest then it goes up by a lot, but not for a single (the picture is from FP8 though))

Picture attached

My tests were performed with a custom docker based on intelanalytics/ipex-llm-serving-xpu:latest (build today). But i tried a lot other builds and the performance does not really differ.

One observation: CCL_WORKER_COUNT=1 is the fastest

FROM intelanalytics/ipex-llm-serving-xpu:latest

WORKDIR /temp

SHELL ["/bin/bash", "-c"]

WORKDIR /llm
RUN . /opt/intel/1ccl-wks/setvars.sh

ENTRYPOINT numactl -C 0-11 python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name ${served_model_name} \
  --quantization ${quantization} \
  --model $model \
  --port $port \
  --trust-remote-code \
  --block-size ${block_size} \
  --gpu-memory-utilization ${gpu_memory_utilization} \
  --device xpu \
  --dtype $dtype \
  --enforce-eager \
  --load-in-low-bit ${load_in_low_bit} \
  --max-model-len ${max_model_len} \
  --max-num-batched-tokens ${max_num_batched_tokens} \
  --max-num-seqs ${max_num_seqs} \
  --tensor-parallel-size ${tensor_parallel_size} \
  --pipeline-parallel-size ${pipeline_parallel_size} \
  --disable-async-output-proc \
  --distributed-executor-backend ray

services:
  vllm-ipex:
    image: intelanalytics/ipex-llm-serving-xpu-custom:latest
    container_name: vllm-ipex
    build:
      dockerfile: ./dockerfile/dockerfile
    volumes:
      - "/models/huggingface:/root/.cache/huggingface"
      - /cloud/custom/ipex-llm/:/ipex-llm
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    devices:
      - /dev/dri:/dev/dri
    privileged: true
    tty: true
    ports:
      - 8000:8000
    shm_size: "32g"
    environment:
      # Model selection
      - quantization=None
      - model=deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
      - served_model_name=DeepSeek-R1-Distill-Qwen-32B

      # Timezone and device
      - TZ=Europe/Berlin
      - DEVICE=Arc

      # Intel/OneAPI/CCL/SYCL
      - SYCL_CACHE_PERSISTENT=1
      - CCL_WORKER_COUNT=1
      - FI_PROVIDER=shm
      - CCL_ATL_TRANSPORT=ofi
      - CCL_ZE_IPC_EXCHANGE=sockets
      - CCL_ATL_SHM=1
      - USE_XETLA=OFF
      - SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
      - TORCH_LLM_ALLREDUCE=0
      - VLLM_USE_V1=0
      - CCL_SAME_STREAM=1
      - CCL_BLOCKING_WAIT=0
      - IPEX_LLM_LOWBIT=fp8

      # vLLM/Serving
      - port=8000
      - gpu_memory_utilization=0.9
      - dtype=float16
      - block_size=8
      - load_in_low_bit=sym_int4
      - max_model_len=9000
      - max_num_batched_tokens=9000
      - max_num_seqs=32
      - tensor_parallel_size=4
      - pipeline_parallel_size=1
      - enforce_eager=true

    restart: unless-stopped

    logging:
      driver: json-file
      options:
        max-size: "10mb"
        max-file: "1"

ENV:

sudo ./env-check.sh
-----------------------------------------------------------------
PYTHON_VERSION=3.10.12
-----------------------------------------------------------------
Transformers is not installed.
-----------------------------------------------------------------
PyTorch is not installed.
-----------------------------------------------------------------
ipex-llm ./env-check.sh: line 58: pip: command not found
-----------------------------------------------------------------
IPEX is not installed.
-----------------------------------------------------------------
CPU Information:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) w5-3423
CPU family:                         6
Model:                              143
Thread(s) per core:                 2
Core(s) per socket:                 12
Socket(s):                          1
Stepping:                           8
CPU max MHz:                        4200.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4224.00
-----------------------------------------------------------------
Total CPU Memory: 125.294 GB
Memory Type: DDR5
-----------------------------------------------------------------
Operating System:
Ubuntu 22.04.5 LTS \n \l

-----------------------------------------------------------------
Linux aiflek 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May  7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
-----------------------------------------------------------------
CLI:
    Version: 1.2.41.20250422
    Build ID: 00000000

Service:
    Version: 1.2.41.20250422
    Build ID: 00000000
    Level Zero Version: 1.21.1
-----------------------------------------------------------------
  Driver UUID                                     32352e31-332e-3333-3237-360000000000
  Driver Version                                  25.13.33276
  Driver UUID                                     32352e31-332e-3333-3237-360000000000
  Driver Version                                  25.13.33276
  Driver UUID                                     32352e31-332e-3333-3237-360000000000
  Driver Version                                  25.13.33276
  Driver UUID                                     32352e31-332e-3333-3237-360000000000
  Driver Version                                  25.13.33276
-----------------------------------------------------------------
Driver related package version:
ii  intel-fw-gpu                                   2025.13.2-398~22.04                     all          Firmware package for Intel integrated and discrete GPUs
ii  intel-i915-dkms                                1.25.1.17.250113.16+i1-1                all          Out of tree i915 driver.
-----------------------------------------------------------------
igpu not detected
-----------------------------------------------------------------
xpu-smi is properly installed.
-----------------------------------------------------------------
+-----------+--------------------------------------------------------------------------------------+
| Device ID | Device Information                                                                   |
+-----------+--------------------------------------------------------------------------------------+
| 0         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0018-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:18:00.0                                                        |
|           | DRM Device: /dev/dri/card0                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 1         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0036-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:36:00.0                                                        |
|           | DRM Device: /dev/dri/card1                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 2         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0054-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:54:00.0                                                        |
|           | DRM Device: /dev/dri/card2                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
| 3         | Device Name: Intel(R) Arc(TM) A770 Graphics                                          |
|           | Vendor Name: Intel(R) Corporation                                                    |
|           | SOC UUID: 00000000-0000-0072-0000-000856a08086                                       |
|           | PCI BDF Address: 0000:72:00.0                                                        |
|           | DRM Device: /dev/dri/card3                                                           |
|           | Function Type: physical                                                              |
+-----------+--------------------------------------------------------------------------------------+
GPU0 Memory size=16G
GPU1 Memory size=16G
GPU2 Memory size=16G
GPU3 Memory size=16G
-----------------------------------------------------------------
18:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Device 172f:3937
        Flags: bus master, fast devsel, latency 0, IRQ 91, NUMA node 0
        Memory at 9e000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 2f800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at 9f000000 [disabled] [size=2M]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
36:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 1020
        Flags: bus master, fast devsel, latency 0, IRQ 94, NUMA node 0
        Memory at a8000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 3f800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at a9000000 [disabled] [size=2M]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
54:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 1020
        Flags: bus master, fast devsel, latency 0, IRQ 97, NUMA node 0
        Memory at b3000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 4f800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at b4000000 [disabled] [size=2M]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
--
72:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation Device 1020
        Flags: bus master, fast devsel, latency 0, IRQ 100, NUMA node 0
        Memory at bd000000 (64-bit, non-prefetchable) [size=16M]
        Memory at 5f800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at be000000 [disabled] [size=2M]
        Capabilities: [40] Vendor Specific Information: Len=0c <?>
        Capabilities: [70] Express Endpoint, MSI 00
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
-----------------------------------------------------------------

Image

flekol avatar Jun 07 '25 19:06 flekol

SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2

Try to change 2 to 0 (may be it brings to you some t/s)

ps And i have same problem with compute and power 😭 B16, b19, b20, b21.. better single thread only with b12-usm.

savvadesogle avatar Jun 07 '25 22:06 savvadesogle

It seems that there is nothing wrong with your test related configuration and OS kernel. Maybe this error is caused by the newer driver:

Driver related package version:
ii  intel-fw-gpu                                   2025.13.2-398~22.04                     all          Firmware package for Intel integrated and discrete GPUs
ii  intel-i915-dkms                                1.25.1.17.250113.16+i1-1                all          Out of tree i915 driver.

You can follow this guide Option 2 to install and test again.

hzjane avatar Jun 09 '25 03:06 hzjane

Hi @hzjane, no change, i tested it with 4 other "official" out of tree drivers. Same numbers or maybe even a bit lower on some.

i reinstalled ubuntu, first tried with 5.15 kernel and to my surprise VLLM always crashed. Llama was running with the same speed as with the previous kernel and divers.

Then i went up to the good kernel and good driver. Same speed.

So what could it be? what do you do differently?

Could it be single core performance? xeon w5-3423 goes up to 4.2GHz, so i suppose this is ok.

What else could i try?

Now i think i will go up to ubuntu 24.04. Check there.

Any other ideas?

flekol avatar Jun 14 '25 19:06 flekol

5.15 kernel

https://cdrdv2-public.intel.com/828236/828236_Installation%20BKC%20and%20AI%20Benchmark%20UG%20on%20Intel%20Xeon_ARC%20A770_rev2.2.pdf

Did you read this document (user guide 828236, v2.2)

savvadesogle avatar Jun 14 '25 19:06 savvadesogle

Thanks a lot!

I have not seen this document yet. But I think I tried all of them. I found another guide once upon a time at intel. And it always is like 2x the performance that I'm getting.

Really really weird. The weird thing is that Llama.cpp provides sometimes better performance even on multi arc config. Which seems wrong :)

It does not have tensor parallel. So yeah, i check the firmware and it seems also ok.

flekol avatar Jun 14 '25 20:06 flekol

Llama.cpp

To be honest I also have speed problems on vllm.

Using llama.3.1 (q4_k_m) as an example.

  1. At various times got up to 61 tps on llama.cpp. Recently on the latest versions of ipex-llm and kernel 6.14 stable 57 t/s. And the speed didn't drop much if you run the model on 2-3-4 cards. On average by 2-3 tokens when adding each card. That is from 57 to 43 t/s

  2. And on docker vllm max 30 t/s (b12-usm) in one thread. On b21 - 16-20 t/s, on several cards 4 times slower (((( about 4.5 t/s And only if in many threads I managed to get 350-500 and even up to 700 t/s on one card 😂 (I ran my own script with parallel requests).

I really have worse hardware than you. I have xeon 2699v3 (2x) + 128gb ddr4( 2133Mhz). But also 4 a770 cards.

savvadesogle avatar Jun 14 '25 20:06 savvadesogle

@flekol Maybe this issue is caused by the 1ccl don't work normally. In your dockerfile, you use RUN . /opt/intel/1ccl-wks/setvars.sh to instead of source /opt/intel/1ccl-wks/setvars.sh. I think it may not work when vllm-service is actually running. You can try to run this example in the official image container first (instead of buiding your image) and to check whether the preformance is normal.

hzjane avatar Jun 16 '25 03:06 hzjane

@hzjane i think this was it. I am still lacking bit behind your results -> Could it be as my cpu is not as powerful as yours in???

It was mentioned in #13173 that vllm V1 might be less CPU hungry.

Any plans on upgrading?

i also see a lot more movement from intel in vllm repo nowadays, so will ipex continue updating vllm or soon we can use the main vllm repo?

Do you have any roadmap for ipex?

flekol avatar Jun 30 '25 20:06 flekol

a lot more movement from intel in vllm repo

you mean this one? https://github.com/vllm-project/vllm/commit/b69781f107b7ad847a351f584178cfafbee2b32a

savvadesogle avatar Jun 30 '25 20:06 savvadesogle

Yes

Lately there was more movement in the prs as well (although reviews are super slow from vllm team.)

Also I think intel said that when the incommin b60 will hit the market, vllm will be up to the task.

Plus pytorch 2.8 will suport xpu natively (without extensions). I'm really curios what will happen.

flekol avatar Jun 30 '25 21:06 flekol

Plus pytorch 2.8

Hm... I think it works from 2.7.. 🤔 Yes, vllm team is slow ))) I hope that this will be the case, that by the fall we will receive a truly reliable vllm)

savvadesogle avatar Jun 30 '25 21:06 savvadesogle