ipex-llm Inference speed and memory usage of Qwen1.5-14b

I have tested the inference speed and memory usage of Qwen1.5-14b on my machine using the example in ipex-llm. The peek cpu usage to load Qwen1.5-14b in 4-bit is about 24GB. The peek GPU usage is about 10GB. The Inference speed is about 4~5 token/s. I export the environment variables set SYCL_CACHE_PERSISTENT=1 and set BIGDL_LLM_XMX_DISABLED=1. Does the inference speed and CPU/GPU memory usage meet the expectation? I think the CPU peak usage is too high and the speed is a little slow.

device Intel(R) Core(TM) Ultra 7 155H 3.80 GHz 32.0 GB (31.6 GB 可用)

env intel-extension-for-pytorch 2.1.10+xpu torch 2.1.0a0+cxx11.abi transformers 4.44.2

Sep 04 '24 11:09 WeiguangHan

Hi @WeiguangHan , we will take a look at this issue and try to reproduce it first. We'll let you know if there's any progress.

Sep 06 '24 01:09 ada-jt1725

Hi @WeiguangHan , we can not reproduce the issue on an Ultra 5 125H CPU.

The CPU usage when running qwen1.5 example script turned out pretty normal: given that the initial usage is about 9GB, the peak CPU memory usage for loading Qwen1.5-14B (converted to int4 using save.py) model is about 10GB. The inference speed is 9.2 tokens/sec when n-predict is set to default 32:

Also, pls note that it is recommended to run performance evaluation with the all-in-one benchmark(https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one). Reference config: and below is the demo output on our machine:

,model,1st token avg latency (ms),2+ avg latency (ms/token),encoder time (ms),input/output tokens,batch_size,actual input/output tokens,num_beams,low_bit,cpu_embedding,model loading time (s),peak mem (GB),streaming,use_fp16_torch_dtype
0,/Qwen1.5-14B-Chat,4517.94,96.96,0.0,1024-128,1,1024-128,1,sym_int4,False,16.18,9.94921875,False,N/A

Sep 10 '24 07:09 ada-jt1725

Hi @WeiguangHan , we can not reproduce the issue on an Ultra 5 125H CPU.

The CPU usage when running qwen1.5 example script turned out pretty normal: given that the initial usage is about 9GB, the peak CPU memory usage for loading Qwen1.5-14B (converted to int4 using save.py) model is about 10GB. The inference speed is 9.2 tokens/sec when n-predict is set to default 32:

Also, pls note that it is recommended to run performance evaluation with the all-in-one benchmark(https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one). Reference config: and below is the demo output on our machine:
,model,1st token avg latency (ms),2+ avg latency (ms/token),encoder time (ms),input/output tokens,batch_size,actual input/output tokens,num_beams,low_bit,cpu_embedding,model loading time (s),peak mem (GB),streaming,use_fp16_torch_dtype
0,/Qwen1.5-14B-Chat,4517.94,96.96,0.0,1024-128,1,1024-128,1,sym_int4,False,16.18,9.94921875,False,N/A

Thanks a lot. The CPU of my computer is Ultra 7 155H. It should have a better performance theoretically. I will try it again according to your instructions.

Sep 11 '24 02:09 WeiguangHan