ipex-llm A770 Performance Issue with INT4

Describe the bug B60 Performance Issue with INT4, use the latest b3 image with vllm.

How to reproduce Start vLLM with 1/2/4 cards and 32B/70B model, you will find the performance is so bad vs multiple A770.

May 21 '25 08:05 RobinJing

Hi, I am investigating this issue.

May 22 '25 01:05 gc-fu

Hi, I could not reproduce this issue on our machines.

Model: DeepSeek-R1-Distill-Qwen-32B

For Arc A770 sym_int4 four cards, we get performance for batch size 1:

============ Serving Benchmark Result ============
Successful requests:                     1
Benchmark duration (s):                  17.65
Total input tokens:                      1024
Total generated tokens:                  512
Request throughput (req/s):              0.06
Output token throughput (tok/s):         29.01
Total Token throughput (tok/s):          87.02
---------------Time to First Token----------------

For B60 platform sym_int4 four cards:

============ Serving Benchmark Result ============
Successful requests:                     1
Benchmark duration (s):                  20.47
Total input tokens:                      1024
Total generated tokens:                  512
Request throughput (req/s):              0.05
Output token throughput (tok/s):         25.01
Total Token throughput (tok/s):          75.04
---------------Time to First Token----------------

May 23 '25 02:05 gc-fu

Hi, I could not reproduce this issue on our machines.

Do you can to share you software config, please? OS ver, vllm image version (intelanalitics/...)

Host's system driver version and guc (a770)?

And vllm start params (+ container)

I have ubuntu 22.04 lts, 6.5 kernel (exactly that recomend in issues). Out-of-tree drivers.. and i cant to reproduce your speed 😭 4x a770, xeon 2699v3 (dual socket) rebar, x16 pcie v3 each gpu. Powerplans + 2400 Mhz gpu...

Jun 16 '25 07:06 savvadesogle

Hi, I could not reproduce this issue on our machines.

Do you can to share you software config, please? OS ver, vllm image version (intelanalitics/...)

Host's system driver version and guc (a770)?

And vllm start params (+ container)

I have ubuntu 22.04 lts, 6.5 kernel (exactly that recomend in issues). Out-of-tree drivers.. and i cant to reproduce your speed 😭 4x a770, xeon 2699v3 (dual socket) rebar, x16 pcie v3 each gpu. Powerplans + 2400 Mhz gpu...

Hi, on host side, there are mainly two things:

kernel: Linux ws-arc-001 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux driver: ii intel-i915-dkms 1.23.10.54.231129.55+i87-1 all Out of tree i915 driver.

Container we recommend to use the newest one: intelanalytics/ipex-llm-serving-xpu:0.8.3-b20

I start the container with the following command:

export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:0.8.3-b20
export IMAGE_NAME=test

export http_proxy=...
export https_proxy=...
export no_proxy=localhost,127.0.0.1


docker stop $IMAGE_NAME
docker rm $IMAGE_NAME


docker run -itd \
        --net=host \
        --device=/dev/dri \
        --privileged \
        --name=$IMAGE_NAME \
        -v $YOUR_MODEL_PATH:/llm/models/ \
        --shm-size="16g" \
        -e http_proxy=$http_proxy \
        -e https_proxy=$https_proxy \
        -e no_proxy=$no_proxy \
        --entrypoint /bin/bash \
        $DOCKER_IMAGE

# 进入容器
docker exec -it $IMAGE_NAME bash

I start the vLLM instance using start-vllm-service.sh script in the image.

Jun 17 '25 02:06 gc-fu

ii intel-i915-dkms 1.23.10.54.231129.55+i87-1 all Out of tree i915 driver.

Hello I have dpkg -l | grep i915 ii intel-i915-dkms 1.23.10.92.231129.101+i141-1 all Out of tree i915 driver. not 54, does it matter? from

echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg]
https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" | \
sudo tee /etc/apt/sources.list.d/intel-gpu-jammy.list

and uname -a Linux xpu 6.5.0-35-generic #35~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue May 7 09:00:52 UTC 2 x86_64 x86_64 x86_64 GNU/Linux

x16 pcie gen 3, ReBAR+, [drm] GT0: GuC firmware i915/dg2_guc_70.44.1.bin version 70.44.1 [drm] GT0: HuC firmware i915/dg2_huc_7.10.16_gsc.bin version 7.10.16

arc@xpu:~$ sudo lspci -vvv -s 05:00.0 05:00.0 VGA compatible controller: Intel Corporation Device 56a0 (rev 08) (prog-if 00 [VGA controller]) Subsystem: ASRock Incorporation Device 6012 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR+ FastB2B- DisINTx- Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 32 bytes Interrupt: pin ? routed to IRQ 74 NUMA node: 0 Region 0: Memory at 90000000 (64-bit, non-prefetchable) [size=16M] Region 2: Memory at 38000000000 (64-bit, prefetchable) [size=16G] Expansion ROM at [disabled] Capabilities: [40] Vendor Specific Information: Len=0c <?> Capabilities: [70] Express (v2) Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq- RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend- LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes, Disabled- CommClk- ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- LnkSta: Speed 2.5GT/s (ok), Width x1 (ok) TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR+ 10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix- EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit- FRS- TPHComp- ExtTPHComp- AtomicOpsCap: 32bit- 64bit- 128bitCAS- DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled, AtomicOpsCtl: ReqEn- LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS- LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1- EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest- Retimer- 2Retimers- CrosslinkRes: unsupported Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+ Address: 00000000fee00758 Data: 0000 Masking: 00000000 Pending: 00000000 Capabilities: [d0] Power Management version 3 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-) Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI) ARICap: MFVC- ACS-, Next Function: 0 ARICtl: MFVC- ACS-, Function Group: 0 Capabilities: [420 v1] Physical Resizable BAR BAR 2: current size: 16GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB Capabilities: [400 v1] Latency Tolerance Reporting Max snoop latency: 0ns Max no snoop latency: 0ns Kernel driver in use: i915 Kernel modules: i915

I can't to get performance like this: https://github.com/intel/ipex-llm/issues/12190#issuecomment-2428480182 (40-45 t/s for llama3.1 8b). With Llama.cpp i have 57 t/s ofr gguf llama3.1-8b-instruct (q4_0) 😭.

With intelanalytics/ipex-llm-serving-xpu:0.8.3-b20 i have the following performance: its for Llama-2-7b-chat FP8 and sym_int4 (for Meta-Llama-3.1-8B-Instruct same)

in b20 image /llm/vllm_online_benchmark.py 
python  vllm_online_benchmark.py Llama-2-7b-chat 1 128 200

results

FP8 TP=1

Total time for 4 requests with 1 concurrent requests: 27.147769431000143 seconds.
Average responce time: 6.786672739499977
Token throughput: 29.468351056734587
Average first token latency: 138.7945234999961 milliseconds.
P90 first token latency: 141.26580850029313 milliseconds.
P95 first token latency: 141.74743375031085 milliseconds.

Average next token latency: 33.40585511683405 milliseconds.
P90 next token latency: 33.70488628291461 milliseconds.
P95 next token latency: 33.75151422688432 milliseconds.

TP=2

Total time for 4 requests with 1 concurrent requests: 35.331129364999924 seconds.
Average responce time: 8.83257064525003
Token throughput: 22.642921819320726

Average first token latency: 100.65780224999799 milliseconds.
P90 first token latency: 102.62717159989734 milliseconds.
P95 first token latency: 103.00236929986113 milliseconds.

Average next token latency: 43.878474477387314 milliseconds.
P90 next token latency: 44.02875971758723 milliseconds.
P95 next token latency: 44.04667158994916 milliseconds.

with export CCL_DG2_USM=1 TP=1

Total time for 4 requests with 1 concurrent requests: 26.323176719999992 seconds.
Average responce time: 6.580587735500103
Token throughput: 30.391468647937575

Average first token latency: 138.6047297502273 milliseconds.
P90 first token latency: 140.51178280014938 milliseconds.
P95 first token latency: 140.91834490013753 milliseconds.

Average next token latency: 32.37135662185889 milliseconds.
P90 next token latency: 32.48628152914512 milliseconds.
P95 next token latency: 32.49526739522578 milliseconds.

TP=2

Total time for 4 requests with 1 concurrent requests: 38.168734376999964 seconds.
Average responce time: 9.541885736749919
Token throughput: 20.959563188505165

Average first token latency: 104.46700950001286 milliseconds.
P90 first token latency: 105.51302009980645 milliseconds.
P95 first token latency: 105.52207604973773 milliseconds.

Average next token latency: 47.42350091834083 milliseconds.
P90 next token latency: 49.830812013568625 milliseconds.
P95 next token latency: 50.26280701683537 milliseconds.

SYM_INT4 TP=1

Total time for 4 requests with 1 concurrent requests: 27.120149009999295 seconds.
Average responce time: 6.779840336750112
Token throughput: 29.498362995905264

Average first token latency: 128.89267025002482 milliseconds.
P90 first token latency: 130.97908580002695 milliseconds.
P95 first token latency: 131.49696290001884 milliseconds.

Average next token latency: 33.421431623115886 milliseconds.
P90 next token latency: 33.91572254422271 milliseconds.
P95 next token latency: 33.91978865150982 milliseconds.

TP=2

Total time for 4 requests with 1 concurrent requests: 34.517932019 seconds.
Average responce time: 8.62924507925004
Token throughput: 23.176359451651077

Average first token latency: 106.9625737500246 milliseconds.
P90 first token latency: 110.07771280010274 milliseconds.
P95 first token latency: 110.38005490022442 milliseconds.

Average next token latency: 42.82499542588066 milliseconds.
P90 next token latency: 43.29726479698585 milliseconds.
P95 next token latency: 43.44278799648369 milliseconds.

power settings:

sudo cpupower frequency-set -d 3.6GHz
sudo xpu-smi config -d 0 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 1 -t 0 --frequencyrange 2400,2400

There is default settings for start-vllm-service.sh (from image b20):

!/bin/bash
MODEL_PATH=${MODEL_PATH:-"default_model_path"}
SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"default_model_name"}
TENSOR_PARALLEL_SIZE=${TENSOR_PARALLEL_SIZE:-1}

MAX_NUM_SEQS=${MAX_NUM_SEQS:-256}
MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS:-3000}
MAX_MODEL_LEN=${MAX_MODEL_LEN:-2000}
LOAD_IN_LOW_BIT=${LOAD_IN_LOW_BIT:-"fp8"}
PORT=${PORT:-8000}

echo "Starting service with model: $MODEL_PATH"
echo "Served model name: $SERVED_MODEL_NAME"
echo "Tensor parallel size: $TENSOR_PARALLEL_SIZE"
echo "Max num sequences: $MAX_NUM_SEQS"
echo "Max num batched tokens: $MAX_NUM_BATCHED_TOKENS"
echo "Max model length: $MAX_MODEL_LEN"
echo "Load in low bit: $LOAD_IN_LOW_BIT"
echo "Port: $PORT"

export USE_XETLA=OFF
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export FI_PROVIDER=shm
export TORCH_LLM_ALLREDUCE=0

export CCL_WORKER_COUNT=2        # On BMG, set CCL_WORKER_COUNT=1; otherwise, internal-oneccl will not function properly
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0
# export CCL_DG2_USM=1         # Needed on Core to enable USM (Shared Memory GPUDirect). Xeon supports P2P and doesn't need t>

export VLLM_USE_V1=0       # Used to select between V0 and V1 engine
export IPEX_LLM_LOWBIT=$LOAD_IN_LOW_BIT        # Ensures low-bit info is used for MoE; otherwise, IPEX's default MoE will be >

source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $SERVED_MODEL_NAME \
  --port $PORT \
  --model $MODEL_PATH \
  --trust-remote-code \
  --block-size 8 \
  --gpu-memory-utilization 0.95 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit $LOAD_IN_LOW_BIT \
  --max-model-len $MAX_MODEL_LEN \
  --max-num-batched-tokens $MAX_NUM_BATCHED_TOKENS \
  --max-num-seqs $MAX_NUM_SEQS \
  --tensor-parallel-size $TENSOR_PARALLEL_SIZE \
  --disable-async-output-proc \
  --distributed-executor-backend ray

Log for Llama-2-7b-chat and llama3.1-8b-instruct:

vllm-log-llama-2-7b-chat.txt

vllm-log-llama-3.1-8b-instruct.txt

Jun 17 '25 14:06 savvadesogle

Hi, I think the driver version is OK.

Another thing is that are you using B60 or using Arc A770? Because these two have different performance.

The last thing is we are now testing the performance using vllm's benchmark serving script. You can find the performance benchmark tool at: https://github.com/intel/ipex-llm/blob/main/docs/mddocs/DockerGuides/vllm_docker_quickstart.md#6-benchmarking

Jun 18 '25 02:06 gc-fu

B60 or using Arc A770?

Hi, i have 4x A770 16GB (Device 56a0).

I used the script that you pointed out to me (/llm/vllm/benchmarks/benchmark_serving.py) params: python /llm/vllm/benchmarks/benchmark_serving.py
--model "/llm/models/Meta-Llama-3.1-8B-Instruct"
--served-model-name "Meta-Llama-3.1-8B-Instruct"
--dataset-name random
--trust_remote_code
--ignore-eos
--num_prompt 1
--random-input-len=1024
--random-output-len=512

And the results are as follows:

FP8 TP=1

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  18.60     
Total input tokens:                      1024      
Total generated tokens:                  512       
Request throughput (req/s):              0.05      
Output token throughput (tok/s):         27.53     
Total Token throughput (tok/s):          82.58     
---------------Time to First Token----------------
Mean TTFT (ms):                          425.70    
Median TTFT (ms):                        425.70    
P99 TTFT (ms):                           425.70    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          35.56     
Median TPOT (ms):                        35.56     
P99 TPOT (ms):                           35.56     
---------------Inter-token Latency----------------
Mean ITL (ms):                           35.56     
Median ITL (ms):                         34.91     
P99 ITL (ms):                            42.72     
==================================================

FP8 TP=2

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  23.64     
Total input tokens:                      1024      
Total generated tokens:                  512       
Request throughput (req/s):              0.04      
Output token throughput (tok/s):         21.66     
Total Token throughput (tok/s):          64.98     
---------------Time to First Token----------------
Mean TTFT (ms):                          304.39    
Median TTFT (ms):                        304.39    
P99 TTFT (ms):                           304.39    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          45.66     
Median TPOT (ms):                        45.66     
P99 TPOT (ms):                           45.66     
---------------Inter-token Latency----------------
Mean ITL (ms):                           45.66     
Median ITL (ms):                         45.24     
P99 ITL (ms):                            56.70     
==================================================

SYM_INT4 TP=1

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  20.15     
Total input tokens:                      1024      
Total generated tokens:                  512       
Request throughput (req/s):              0.05      
Output token throughput (tok/s):         25.41     
Total Token throughput (tok/s):          76.24     
---------------Time to First Token----------------
Mean TTFT (ms):                          415.54    
Median TTFT (ms):                        415.54    
P99 TTFT (ms):                           415.54    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          38.61     
Median TPOT (ms):                        38.61     
P99 TPOT (ms):                           38.61     
---------------Inter-token Latency----------------
Mean ITL (ms):                           38.61     
Median ITL (ms):                         38.18     
P99 ITL (ms):                            44.80     
==================================================

SYM_INT4 TP=2

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  23.03     
Total input tokens:                      1024      
Total generated tokens:                  512       
Request throughput (req/s):              0.04      
Output token throughput (tok/s):         22.23     
Total Token throughput (tok/s):          66.70     
---------------Time to First Token----------------
Mean TTFT (ms):                          304.77    
Median TTFT (ms):                        304.77    
P99 TTFT (ms):                           304.77    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          44.46     
Median TPOT (ms):                        44.46     
P99 TPOT (ms):                           44.46     
---------------Inter-token Latency----------------
Mean ITL (ms):                           44.46     
Median ITL (ms):                         44.14     
P99 ITL (ms):                            53.80     
==================================================

Other people have a capacity of 40-45 t/s https://github.com/intel/ipex-llm/issues/12190#issuecomment-2428480182 https://github.com/intel/ipex-llm/issues/12190#issuecomment-2445786724

And it makes me sad(

ps used before bench sudo cpupower frequency-set -d 3.6GHz sudo xpu-smi config -d 0 -t 0 --frequencyrange 2400,2400 sudo xpu-smi config -d 1 -t 0 --frequencyrange 2400,2400

LOGS:

vllm-log-tp-1-llama3.1-8b-fp8.txt

script-log-tp-llama-3.1-8b-fp8.txt

start-vllm-service.sh

#!/bin/bash
MODEL_PATH=${MODEL_PATH:-"default_model_path"}
SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"default_model_name"}
TENSOR_PARALLEL_SIZE=${TENSOR_PARALLEL_SIZE:-1}
PIPELINE_PARALLEL_SIZE=${PIPELINE_PARALLEL_SIZE:-1}

MAX_NUM_SEQS=${MAX_NUM_SEQS:-256}
MAX_NUM_BATCHED_TOKENS=${MAX_NUM_BATCHED_TOKENS:-3000}
MAX_MODEL_LEN=${MAX_MODEL_LEN:-2000}
LOAD_IN_LOW_BIT=${LOAD_IN_LOW_BIT:-"fp8"}
PORT=${PORT:-8000}

echo "Starting service with model: $MODEL_PATH"
echo "Served model name: $SERVED_MODEL_NAME"
echo "Tensor parallel size: $TENSOR_PARALLEL_SIZE"
echo "Pipeline parallel size: $PIPELINE_PARALLEL_SIZE"
echo "Max num sequences: $MAX_NUM_SEQS"
echo "Max num batched tokens: $MAX_NUM_BATCHED_TOKENS"
echo "Max model length: $MAX_MODEL_LEN"
echo "Load in low bit: $LOAD_IN_LOW_BIT"
echo "Port: $PORT"

export USE_XETLA=OFF
export SYCL_CACHE_PERSISTENT=1
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export FI_PROVIDER=shm
export TORCH_LLM_ALLREDUCE=0

export CCL_WORKER_COUNT=2        # On BMG, set CCL_WORKER_COUNT=1; otherwise, internal-oneccl will not function properly
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export CCL_ATL_SHM=1
export CCL_SAME_STREAM=1
export CCL_BLOCKING_WAIT=0
export CCL_DG2_USM=1         # Needed on Core to enable USM (Shared Memory GPUDirect). Xeon supports P2P and doesn't need this.

export VLLM_USE_V1=0       # Used to select between V0 and V1 engine
export IPEX_LLM_LOWBIT=$LOAD_IN_LOW_BIT        # Ensures low-bit info is used for MoE; otherwise, IPEX's default MoE will be used

source /opt/intel/1ccl-wks/setvars.sh

python -m ipex_llm.vllm.xpu.entrypoints.openai.api_server \
  --served-model-name $SERVED_MODEL_NAME \
  --port $PORT \
  --model $MODEL_PATH \
  --trust-remote-code \
  --block-size 8 \
  --gpu-memory-utilization 0.95 \
  --device xpu \
  --dtype float16 \
  --enforce-eager \
  --load-in-low-bit $LOAD_IN_LOW_BIT \
  --max-model-len $MAX_MODEL_LEN \
  --max-num-batched-tokens $MAX_NUM_BATCHED_TOKENS \
  --max-num-seqs $MAX_NUM_SEQS \
  --tensor-parallel-size $TENSOR_PARALLEL_SIZE \
  --pipeline-parallel-size $PIPELINE_PARALLEL_SIZE \
  --disable-async-output-proc \
  --distributed-executor-backend ray

Jun 18 '25 08:06 savvadesogle

And here is DeepSeek-R1-Distill-Qwen-32B (FP8) tensor-parallel=4

root@xpu:/llm/vllm/benchmarks# python /llm/vllm/benchmarks/benchmark_serving.py
--model "/llm/models/DeepSeek-R1-Distill-Qwen-32B"
--served-model-name "DeepSeek-R1-Distill-Qwen-32B"
--dataset-name random
--trust_remote_code
--ignore-eos
--num_prompt 1
--random-input-len=1024
--random-output-len=512

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  49.33     
Total input tokens:                      1024      
Total generated tokens:                  512       
Request throughput (req/s):              0.02      
Output token throughput (tok/s):         10.38     
Total Token throughput (tok/s):          31.14     
---------------Time to First Token----------------
Mean TTFT (ms):                          951.70    
Median TTFT (ms):                        951.70    
P99 TTFT (ms):                           951.70    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          94.67     
Median TPOT (ms):                        94.67     
P99 TPOT (ms):                           94.67     
---------------Inter-token Latency----------------
Mean ITL (ms):                           94.67     
Median ITL (ms):                         89.40     
P99 ITL (ms):                            133.67    
==================================================

SYM_INT4 TP=4

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  42.79     
Total input tokens:                      1024      
Total generated tokens:                  512       
Request throughput (req/s):              0.02      
Output token throughput (tok/s):         11.97     
Total Token throughput (tok/s):          35.90     
---------------Time to First Token----------------
Mean TTFT (ms):                          973.00    
Median TTFT (ms):                        973.00    
P99 TTFT (ms):                           973.00    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          81.83     
Median TPOT (ms):                        81.83     
P99 TPOT (ms):                           81.83     
---------------Inter-token Latency----------------
Mean ITL (ms):                           81.83     
Median ITL (ms):                         81.83     
P99 ITL (ms):                            90.56     
==================================================

Jun 18 '25 09:06 savvadesogle

Hi, I will try to reproduce your result and see which part is different.

Jun 19 '25 02:06 gc-fu

Hi, below you can find my result for fp8 Llama3-8b:

============ Serving Benchmark Result ============
Successful requests:                     1
Benchmark duration (s):                  12.92
Total input tokens:                      1024
Total generated tokens:                  512
Request throughput (req/s):              0.08
Output token throughput (tok/s):         39.61
Total Token throughput (tok/s):          118.84
---------------Time to First Token----------------
Mean TTFT (ms):                          423.72
Median TTFT (ms):                        423.72
P99 TTFT (ms):                           423.72
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          24.46
Median TPOT (ms):                        24.46
P99 TPOT (ms):                           24.46
---------------Inter-token Latency----------------
Mean ITL (ms):                           24.46
Median ITL (ms):                         24.03
P99 ITL (ms):                            28.42
==================================================
Completed benchmark with num_prompt=1, random-input-len=1024.
------------------------------------------------------------

Can you try installing the xpu-smi . The intel-gpu-top can only show limited information.

With xpu-smi dump -m 0,1,2,3,18 we can check the if the GPU power level.

For instance, while running on our platform, the GPU utilization rate is 100% and the power is around 160 watt.

Can you check if the GPU power is correct or not

Jun 19 '25 05:06 gc-fu

Can you check if the GPU power is correct or not

Hello

I disable some cores (8 + 8 left). All tests are FP8, input=1024, output=512 /etc/default/grub GRUB_CMDLINE_LINUX="pcie_aspm=off pci=realloc"

ps on ubuntu 25.04 (6.14 kernel) i have same problem with power consumption with vLLM (b16, b20, b21 etc). When I launched llama.cpp (SYCL, IPEX-LM), the consumption is high and the speed is 57 t/s for llama3.1 8B (q4_0) on single GPU (a770). And I've never needed to use sudo xpu-smi config -d 0 -t 0 --frequency range 2400,2400 for llama (llama.cpp ipex-lm).

TP=1 SYM_INT4 num_prompt=1 138W

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  18.84     
Total input tokens:                      1024      
Total generated tokens:                  512       
Request throughput (req/s):              0.05      
Output token throughput (tok/s):         27.18     
Total Token throughput (tok/s):          81.53     
---------------Time to First Token----------------
Mean TTFT (ms):                          427.64    
Median TTFT (ms):                        427.64    
P99 TTFT (ms):                           427.64    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          36.03     
Median TPOT (ms):                        36.03     
P99 TPOT (ms):                           36.03     
---------------Inter-token Latency----------------
Mean ITL (ms):                           36.03     
Median ITL (ms):                         35.50     
P99 ITL (ms):                            44.58     
==================================================

TP=1 SYM_INT4 num_prompt=30 190W

============ Serving Benchmark Result ============
Successful requests:                     30        
Benchmark duration (s):                  54.39     
Total input tokens:                      30720     
Total generated tokens:                  15360     
Request throughput (req/s):              0.55      
Output token throughput (tok/s):         282.40    
Total Token throughput (tok/s):          847.20    
---------------Time to First Token----------------
Mean TTFT (ms):                          5690.23   
Median TTFT (ms):                        5711.48   
P99 TTFT (ms):                           10432.14  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          68.35     
Median TPOT (ms):                        67.42     
P99 TPOT (ms):                           85.18     
---------------Inter-token Latency----------------
Mean ITL (ms):                           68.38     
Median ITL (ms):                         52.70     
P99 ITL (ms):                            63.24     
==================================================

TP=4 SYM_INT4 num_prompt=1 85-89W

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  30.60     
Total input tokens:                      1024      
Total generated tokens:                  512       
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         16.73     
Total Token throughput (tok/s):          50.20     
---------------Time to First Token----------------
Mean TTFT (ms):                          420.99    
Median TTFT (ms):                        420.99    
P99 TTFT (ms):                           420.99    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          59.05     
Median TPOT (ms):                        59.05     
P99 TPOT (ms):                           59.05     
---------------Inter-token Latency----------------
Mean ITL (ms):                           59.05     
Median ITL (ms):                         57.81     
P99 ITL (ms):                            87.95     
==================================================

TP=4 SYM_INT4 num_prompt=30 93-111W

============ Serving Benchmark Result ============
Successful requests:                     30        
Benchmark duration (s):                  42.79     
Total input tokens:                      30720     
Total generated tokens:                  15360     
Request throughput (req/s):              0.70      
Output token throughput (tok/s):         358.92    
Total Token throughput (tok/s):          1076.77   
---------------Time to First Token----------------
Mean TTFT (ms):                          5808.34   
Median TTFT (ms):                        5863.51   
P99 TTFT (ms):                           9795.98   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          72.36     
Median TPOT (ms):                        72.26     
P99 TPOT (ms):                           81.31     
---------------Inter-token Latency----------------
Mean ITL (ms):                           72.36     
Median ITL (ms):                         62.48     
P99 ITL (ms):                            92.39     
==================================================

Jun 19 '25 10:06 savvadesogle

ollama-ipex-llm-2.3.0b20250612-ubuntu ./ollama run llama3.1:8b-instruct-q4_0 --verbose prompt "how to make a pizza?"

1 GPU ~170W (nvtop)

total duration:       11.701896244s
load duration:        22.330773ms
prompt eval count:    16 token(s)
prompt eval duration: 16.426966ms
prompt eval rate:     974.01 tokens/s
eval count:           729 token(s)
eval duration:        11.662139276s
eval rate:            62.51 tokens/s

4GPU ~89-97W (nvtop)

total duration:       12.586345975s
load duration:        22.298785ms
prompt eval count:    16 token(s)
prompt eval duration: 107.398379ms
prompt eval rate:     148.98 tokens/s
eval count:           699 token(s)
eval duration:        12.454732149s
eval rate:            56.12 tokens/s

Jun 19 '25 10:06 savvadesogle

Hi @savvadesogle , tks for your update, we are going to reproduce the vLLM and Ollama issues.

Jun 20 '25 02:06 Uxito-Ada

Ollama issues

Hello, i have no problem with ollama (llama.cpp), its just for example. I wanted to show that there are no performance issues in ollama in the current environment. I have problem in vLLM only. If you need anything more, let me know. 🤝

Jun 20 '25 06:06 savvadesogle

Ollama issues

Hello, i have no problem with ollama (llama.cpp), its just for example. I wanted to show that there are no performance issues in ollama in the current environment. I have problem in vLLM only. If you need anything more, let me know. 🤝

Hi @savvadesogle , glad to hear that, and I will focus on vLLM then.

Jun 20 '25 06:06 Uxito-Ada

I suppose there is a general problem with vllm and ipex. There are now multiple issues raised about subpar performance of vllm and multi arc setup.

Check also one mine #13214. Exactly the same problem. 1 arc is fine and power consumption goes to quite high numbers. But adding the 2nd and turning tensor parallel on gives bad performance.

Jun 20 '25 14:06 flekol

1 arc is fine

In my case, I can't get the same performance numbers as others (https://github.com/intel/ipex-llm/issues/12190#issuecomment-2428480182), even for a single card. Others have a llama3.1 8b model with a performance of 45t/s in ubuntu 22.0.4.05, out of tree driver ( vLLM). But I have a performance ~ 30t/s.

And yes, instead of increasing performance with a second card, I'm experiencing a decrease.

ps i saw you problem.. I also thought your problem was solved in your ticket ))

Jun 20 '25 15:06 savvadesogle

B60 or using Arc A770?

Hi, i have 4x A770 16GB (Device 56a0). ......

Hi @savvadesogle ,

As you are using A770, which is a different GPU from B60, the vllm image you use should be A770's specific one rather than B60's, otherwise there will be regression caused by different software stacks e.g. oneAPI and v0/v1 engine.

Pls try intelanalytics/ipex-llm-serving-xpu:0.8.3-b21, the A770's. Also I changed issue title to A770 to avoid confuse, and pin me if I misunderstood.

Jun 23 '25 02:06 Uxito-Ada

different software stacks

Hello

Could you please list the software that needs to be installed on the host system along with specific versions (level-zero-* etc)?
And which specific driver (out-of-tree) version should be installed? Right now i have 1.23.10.92.231129.101+i141-1 all Out of tree i915 driver
Is this instruction actrual? https://cdrdv2-public.intel.com/828236/828236_Installation%20BKC%20and%20AI%20Benchmark%20UG%20on%20Intel%20Xeon_ARC%20A770_rev2.2.pdf Kernel 6.8 (recomended), but i have 22.04.05 ( 6.5.0-35-generic)

And which one repo should i use: 1: deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified

deb [arch=amd64,i386 signed-by=/usr/share/keyrings/intel-graphics.gpg] https://repositories.intel.com/gpu/ubuntu jammy unified

ps i have xeon 2699 v3 (dual)

Jun 23 '25 05:06 savvadesogle

Hi @savvadesogle ,

Pls use the corresponding docker image, which has encapsulated the necessary dependencies as solution.

Jun 23 '25 05:06 Uxito-Ada

intelanalytics/ipex-llm-serving-xpu:0.8.3-b21

I have the same performance

python  vllm_online_benchmark.py Meta-Llama-3.1-8B-Instruct 1 1024 512
Input = 1024, Output = 512, single thread (num thread, request)

TP=1 PP=1 SYM_INT4 ~ 123W

Total time for 4 requests with 1 concurrent requests: 68.88816110300013 seconds.
Average responce time: 17.221720198750063
Token throughput: 29.72934633772374

Average first token latency: 411.8512585001781 milliseconds.
P90 first token latency: 414.15048390017546 milliseconds.
P95 first token latency: 414.35601345019677 milliseconds.

Average next token latency: 32.89581470009781 milliseconds.
P90 next token latency: 33.163512119764995 milliseconds.
P95 next token latency: 33.17083428199601 milliseconds.

vLLM log Meta-Llama-3.1-8B-Instruct tp=1 sym_int4 vllm-log-llama3.1-int4-tp-1.txt

TP=2 PP=1 SYM_INT4 ~ 94-99W

Total time for 4 requests with 1 concurrent requests: 94.58585705999985 seconds.
Average responce time: 23.64618669624997
Token throughput: 21.652285697436415

Average first token latency: 258.7807229998589 milliseconds.
P90 first token latency: 262.05062619987984 milliseconds.
P95 first token latency: 262.2665790999008 milliseconds.

Average next token latency: 45.767716301859316 milliseconds.
P90 next token latency: 46.149929201957534 milliseconds.
P95 next token latency: 46.17648849139004 milliseconds.

TP=4 PP=1 SYM_INT4 ERROR i can't use tp=4 because it stop at

 (WrapperWithLoadBit pid=8239) INFO 06-23 15:11:26 [loader.py:447] Loading weights took 5.00 seconds [repeated 2x across cluster]

and 1 cpu has to much temp at this moment. During normal operation of the VLLM, such temperatures are not observed (usually Max 60-65 C)

Plus 1 GPU (device 3 on the screenshot is using for monitor 3440x1440, 50HZ, DP) is unused (i think)

Full log from vllm:

ERROR-vllm-log-llama3.1-int4-tp-4.txt

IF i uncomment

#export CCL_DG2_USM=1

all is working

TP=4 PP=1 SYM_INT4 with export CCL_DG2_USM=1 ~ 66-90W

Total time for 4 requests with 1 concurrent requests: 175.11094860400044 seconds.
Average responce time: 43.777491427499854
Token throughput: 11.695442325718822

Average first token latency: 2668.648231499674 milliseconds.
P90 first token latency: 2742.8675142998145 milliseconds.
P95 first token latency: 2764.3249076498705 milliseconds.

Average next token latency: 80.44762730528436 milliseconds.
P90 next token latency: 82.77847963111573 milliseconds.
P95 next token latency: 82.78412867955035 milliseconds.

Jun 23 '25 07:06 savvadesogle

Another test (/llm/vllm/benchmarks/benchmark_serving.py ) INT4

python /llm/vllm/benchmarks/benchmark_serving.py --model "/llm/models/Meta-Llama-3.1-8B-Instruct" --served-model-name "Meta-Llama-3.1-8B-Instruct" --dataset-name random --trust_remote_code --ignore-eos --num_prompt 1 --random-input-len=1024 --random-output-len=512

TP=1

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  18.02     
Total input tokens:                      1024      
Total generated tokens:                  512       
Request throughput (req/s):              0.06      
Output token throughput (tok/s):         28.41     
Total Token throughput (tok/s):          85.22     
---------------Time to First Token----------------
Mean TTFT (ms):                          418.32    
Median TTFT (ms):                        418.32    
P99 TTFT (ms):                           418.32    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          34.45     
Median TPOT (ms):                        34.45     
P99 TPOT (ms):                           34.45     
---------------Inter-token Latency----------------
Mean ITL (ms):                           34.45     
Median ITL (ms):                         34.03     
P99 ITL (ms):                            43.07     
==================================================

log-xpu-smi.txt

Jun 23 '25 07:06 savvadesogle

+TP=4

python /llm/vllm/benchmarks/benchmark_serving.py --model "/llm/models/Meta-Llama-3.1-8B-Instruct" --served-model-name "Meta-Llama-3.1-8B-Instruct" --dataset-name random --trust_remote_code --ignore-eos --num_prompt 1 --random-input-len=1024 --random-output-len=512

============ Serving Benchmark Result ============
Successful requests:                     1         
Benchmark duration (s):                  26.10     
Total input tokens:                      1024      
Total generated tokens:                  512       
Request throughput (req/s):              0.04      
Output token throughput (tok/s):         19.61     
Total Token throughput (tok/s):          58.84     
---------------Time to First Token----------------
Mean TTFT (ms):                          364.00    
Median TTFT (ms):                        364.00    
P99 TTFT (ms):                           364.00    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          50.37     
Median TPOT (ms):                        50.37     
P99 TPOT (ms):                           50.37     
---------------Inter-token Latency----------------
Mean ITL (ms):                           50.37     
Median ITL (ms):                         44.81     
P99 ITL (ms):                            155.78    
==================================================

Logs

vllm-log-llama3.1-int4-tp-4.txt

log-xpu-smi-tp-4.txt

Jun 23 '25 07:06 savvadesogle

190W TP=1 num_prompt=16

============ Serving Benchmark Result ============
Successful requests:                     16        
Benchmark duration (s):                  24.55     
Total input tokens:                      16384     
Total generated tokens:                  8192      
Request throughput (req/s):              0.65      
Output token throughput (tok/s):         333.74    
Total Token throughput (tok/s):          1001.23   
---------------Time to First Token----------------
Mean TTFT (ms):                          3134.63   
Median TTFT (ms):                        3151.34   
P99 TTFT (ms):                           5534.06   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          41.89     
Median TPOT (ms):                        41.86     
P99 TPOT (ms):                           46.97     
---------------Inter-token Latency----------------
Mean ITL (ms):                           41.89     
Median ITL (ms):                         37.02     
P99 ITL (ms):                            42.17     
==================================================

log-xpu-smi-tp-1-num-16.txt

vllm-log-llama3.1-int4-tp-1-num-16.txt

Jun 23 '25 07:06 savvadesogle

@savvadesogle

What real-world performance have you achieved on 4x ARC A770 with models of approximately 30B parameters?

Jun 23 '25 09:06 fif6

30B parameters

70B (q4_0) llama3.1 9.1 t/s (x3 A770, ollama ipex-llm portable) A770 here: https://huggingface.co/spaces/evilfreelancer/msnp-leaderboard

DeepSeek-R1-Distill-Qwen-32B You can check this (https://github.com/intel/ipex-llm/issues/13173#issuecomment-2903063470) 29.01 t/s And my results in this issue is: https://github.com/intel/ipex-llm/issues/13173#issuecomment-2983369120 11.97 t/s (3x slower then has gc-fu and its close to 70B (q4_0) 🤣🤣🤣

Jun 23 '25 09:06 savvadesogle

Hi, I will check your output and logs again and see if I can find anything unnormal.

Jun 24 '25 02:06 gc-fu

Hi, I am thinking about the hardware configurations.

In our test plan, we have Intel(R) Xeon(R) w5-3435X for CPU and PCIE v4 x16, which have much larger bandwidth.

Besides, have you pinned your CPU/GPU frequencies?

Jun 24 '25 02:06 gc-fu

CPU/GPU frequencies

Hello. I have 2x xeon 2699 v3 (3.6 GHz). GPU frequency is about 2400 MHz. https://github.com/user-attachments/files/20860882/log-xpu-smi-tp-1-num-16.txt https://github.com/user-attachments/files/20860639/log-xpu-smi.txt

Can you check rx/tx your pcie, while inference (for 1 thread on 1/2 gpu)? I think x16 pcie gen 3 is enough.

And here is another issue with pcie gen 4 😀 Here is more modern hardware then i have, but the same problem: https://github.com/intel/ipex-llm/issues/13214

And in this issue https://github.com/intel/ipex-llm/issues/12190#issuecomment-2445786724 hardware is the same: pcie gen3, 2695 v4 (3.3 max core speed). And he has 41-45 t/s at the end of the issue.

Jun 24 '25 06:06 savvadesogle

Besides, have you pinned your CPU/GPU frequencies?

Sorry, I misunderstood. Yes, I use this commands:

sudo cpupower frequency-set -d 3.6GHz
sudo xpu-smi config -d 0 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 1 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 2 -t 0 --frequencyrange 2400,2400
sudo xpu-smi config -d 3 -t 0 --frequencyrange 2400,2400

I really don't understand why through llama (llama.cpp, Meta-Llama-3.1-8B-Instruct Q4_0) I get 65 t/s and through vLLM only ~30 t/s.

And here is an example of xpu_smi (vLLM) single thread is 120W, and 99% load... no problem with pcie (link).

Jun 24 '25 13:06 savvadesogle