ipex-llm
ipex-llm copied to clipboard
Question about benchmark result
I have used all-in-one benchmark to test on Intel Ultra 9 185H's NPU, The model used is Owen/Qwen2-7B. I'm confused about the result. In this repo's image shows tokens/s for 32 tokens/input is 19.6 under Intel Ultra 7 165H. My result in csv file is
1st token average latency(ms): 617.64
2+ avg latency(ms/token): 340.45
encoder time(ms): 0
input/ouput tokens: 32-32
batch_size: 1
actual input/output tokens: 32-32
num_beams: 1
low_bit: sym_int
cpu_embendding: False
model loading time(s): 88.19
peak mem(GB): N/A
streaming: False
use_fp16_torch_dtype: N/A
My questions are
- Is "tokens/s" calculated from "2+ avg latency(ms/token)"? If so, that will be 1000 / 340.45 = 2.94
- Is the benchmark result provided by this repo using cpu, igpu, or npu? If npu is used during the benchmark, then my result is far from 19.6.
my config is:
repo_id:
- 'Qwen/Qwen2-7B'
local_model_hub: 'path/to/local/model'
warm_up: 1
num_trials: 3
num_beams: 1
low_bit: 'sym_int4'
batch_size: 1
in_out_pairs:
- '32-32'
- '1024-128'
test_api:
- "transformers_int4_npu_win"
cpu_embedding: False # whether put embedding to CPU
streaming: False
task: 'continuation'
When running the benchmark, I can see the NPU's resource is being used in task manager.
have you solved this problem? I also run the qwen2-7b(int4) example using NPU. Inference speed is too slow, only 2-3 tokens/s.