intel-extension-for-transformers Why is the inference time so long on CPU(i9-13900) of Win11?

trafficstars

I used intel-extension-for-transformers of INT4 LLaMA2 model. The model is from python scripts/covert.py and python scripts/quantize.py. The operation system is Win11, and the cpu is 13th Gen Intel(R) Core(TM) i9-13900HX. The length of the input token is 1037. My command prompt is "python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 1500 -b 2048 -n 256 -t 56". The inference time is multiple times of LLaMA.cpp. So is there anything wrong with my usage?

Nov 14 '23 04:11 zhengshuo1

We updated AVX_VNNI support last week, and it would get a better performance in 12th and 13th Intel CPU. You can update your code and try python scripts/quantize.py --model_name llama2 --model_file ~/ne-f32.bin --out_file ~/ne-q4_j.bin --weight_dtype int4 --group_size -1 --scale_dtype fp32 --compute_dtype int8 --alg sym --nthread 8 to quant model and use python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 1500 -b 2048 -n 256 -t 8 to run the int4 model.

Nov 14 '23 06:11 yuchengliu1

We updated AVX_VNNI support last week, and it would get a better performance in 12th and 13th Intel CPU. You can update your code and try python scripts/quantize.py --model_name llama2 --model_file ~/ne-f32.bin --out_file ~/ne-q4_j.bin --weight_dtype int4 --group_size -1 --scale_dtype fp32 --compute_dtype int8 --alg sym --nthread 8 to quant model and use python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 1500 -b 2048 -n 256 -t 8 to run the int4 model.

I'm sorry, but I have checked the code. My downloaded code contains the AVX_VNNI support. However, the speed is still very slow. Do you have any standard data about the speed?

Nov 16 '23 02:11 zhengshuo1

We tested our AVX_VNNI code on 12900 with 4800DDR5. The latency of the first token is 39682ms for a 1024-length input tokens, and the latency of the rest token is 189ms per token.

Nov 16 '23 03:11 yuchengliu1

Here are some logs of my usage. Have I used the supported AVX_VNNI? Welcome to use the llama on the ITREX! main: seed = 1700113228 AVX:1 AVX2:1 AVX512F:0 AVX_VNNI:1 AVX512_VNNI:0 AMX_INT8:0 AMX_BF16:0 AVX512_BF16:0 AVX512_FP16:0 model.cpp: loading model from ne-q4_j.bin init: n_vocab = 32000 init: n_embd = 4096 init: n_mult = 256 init: n_head = 32 init: n_head_kv = 32 init: n_layer = 32 init: n_rot = 128 init: n_ff = 11008 init: n_parts = 1 load: ne ctx size = 3635.30 MB load: mem required = 5685.30 MB (+ memory per state) ................................................................................................... model_init_from_file: support_jblas_kv = 0 model_init_from_file: kv self size = 375.00 MB

system_info: n_threads = 8 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | F16C = 1 | BLAS = 0 | SSE3 = 1 | VSX = 0 | sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000 generate: n_ctx = 1500, n_batch = 2048, n_predict = 256, n_keep = 0

model_print_timings: load time = 1484607.07 ms model_print_timings: sample time = 397.51 ms / 170 runs ( 2.34 ms per token) model_print_timings: prompt eval time = 1483003.51 ms / 1085 tokens ( 1366.82 ms per token) model_print_timings: eval time = 502225.24 ms / 169 runs ( 2971.75 ms per token) model_print_timings: total time = 1987475.40 ms

Nov 16 '23 08:11 zhengshuo1

Thank you for your feedback. We partially reproduced your performance in Win11. We will fix the performance problem in a few days. Furthermore, We discovered there is a huge performance gap between WSL and Windows. If you want get our expected performance for now, you can try our code in WSL.

Nov 16 '23 15:11 yuchengliu1

We tested our AVX_VNNI code on 12900 with 4800DDR5. The latency of the first token is 39682ms for a 1024-length input tokens, and the latency of the rest token is 189ms per token.

And I once tested llama.cpp‘s performance on the same machine. The latency of the first token is 43050ms for a 1063-length input tokens, and the latency of the rest token is 119ms per token. So maybe the intel-extension-for-transformers has the equivalent performance as the llama.cpp on Intel-Core CPU now. Is my opinion right?

Nov 17 '23 02:11 zhengshuo1

Please rebuild your project by the commend according to the new README. We expect to get a huge advantage in the first token and have a equivalent performance in the rest token compare to llama.cpp.

Nov 17 '23 09:11 yuchengliu1

Please rebuild your project by the commend according to the new README. We expect to get a huge advantage in the first token and have a equivalent performance in the rest token compare to llama.cpp.

Hello. I read the new README. Were the old execution files under debug mode, and the new ones under release mode, so the inference time on Windows would shorten? I compiled under the release mode, and run the "scripts/inference.py" (the exact execution file was run_llama.exe), however, the output was empty. If I want to add some logs to see what happened, what files should I focus on?

Nov 20 '23 06:11 zhengshuo1

The path in the python is wrong and the python script cannot find the .exe. We will fix it. You can try directly execute run_llama.exe in cmd to get logs. The parameter is -m ne-q4_j.bin -c 1500 -b 2048 -n 256 -t 8 -p ....

Nov 21 '23 07:11 yuchengliu1

The path in the python is wrong and the python script cannot find the .exe. We will fix it. You can try directly execute run_llama.exe in cmd to get logs. The parameter is -m ne-q4_j.bin -c 1500 -b 2048 -n 256 -t 8 -p ....

I think I put run_llama.exe just under the right path "build" instead of "build/bin/Release". So I will run run_llama.exe in cmd to see what happened.

Nov 21 '23 09:11 zhengshuo1

屏幕截图 2023-11-21 172719 There is no difference between I use inference.py and run_llama.exe, the output of the release execution is empty. And the log doesn't show any exceptions.

Nov 21 '23 09:11 zhengshuo1

You should add -p "...." to give a beginning, just like run_llama.exe -m ne-q4_j.bin -c 1500 -b 2048 -n 256 -t 8 -p "She opened the door and saw" or python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 1500 -b 2048 -n 256 -t 8 -p "She opened door and saw"

Nov 21 '23 15:11 yuchengliu1

You should add -p "...." to give a beginning, just like run_llama.exe -m ne-q4_j.bin -c 1500 -b 2048 -n 256 -t 8 -p "She opened the door and saw" or python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 1500 -b 2048 -n 256 -t 8 -p "She opened door and saw"

When I used inference.py, I wrote my own prompt in the python file, so I didn't need to set parameter p. When I used the run_llama.exe, I forgot to set the parameter. However, I pulled the newest version and set the parameter p when I used the release run_llama.exe just now. The phenomenon was also the same as before. Look at the two pictures. The first picture is the log when I used the debug run_llama.exe. And the second one is the release version. There is an empty output of the release version. So I think there is still something wrong. 屏幕截图 2023-11-22 092653 屏幕截图 2023-11-22 092723

Nov 22 '23 01:11 zhengshuo1

Sorry, I cannot reproduce the behavior of your Release version. You may delete the build dict and rebuild our code from the start. If run_llama.exe still outputs nothing, you can provide your build commands to help us debug it. Thanks.

Nov 22 '23 06:11 yuchengliu1

Sorry, I cannot reproduce the behavior of your Release version. You may delete the build dict and rebuild our code from the start. If run_llama.exe still outputs nothing, you can provide your build commands to help us debug it. Thanks.

I once deleted the build dict and rebuilt it. The build commands are: <1>mkdir build <2>cd build <3>cmake .. <4>cmake --build . -j --config Release I think they are the same with README.md

Nov 22 '23 07:11 zhengshuo1

Sorry, I cannot reproduce the behavior of your Release version. You may delete the build dict and rebuild our code from the start. If run_llama.exe still outputs nothing, you can provide your build commands to help us debug it. Thanks.

another question. Should I regenerate the model file using the released execution?

Nov 22 '23 08:11 zhengshuo1

Your system also have VisualStudio 2022 and run these commands in Developer PowerShell for VS 2022?

Nov 22 '23 08:11 yuchengliu1

Sorry, I cannot reproduce the behavior of your Release version. You may delete the build dict and rebuild our code from the start. If run_llama.exe still outputs nothing, you can provide your build commands to help us debug it. Thanks.

another question. Should I regenerate the model file using the released execution?

No need. Model file is specific to CPU.

Nov 22 '23 08:11 yuchengliu1

Your system also have VisualStudio 2022 and run these commands in Developer PowerShell for VS 2022?

It still doesn't work. And after I added some logs, I found the crash location was at the function "ne_graph_compute". I didn't continue to search for the crash location in which line of ne_graph_compute. Maybe this can give you some information about the phenomenon? Maybe this error doesn't happen on your machine. However it happened on my machine because of some special hardware configuration.

Nov 22 '23 08:11 zhengshuo1

I recommend to use VisualStudio to debug our code in windows if the code crashed somewhere. VisualStudio will carry you to the crash line if you run the code in VisualStudio. Does the code crash on your machine? Or it have no output after a long time with low CPU usage?

Nov 22 '23 08:11 yuchengliu1

I recommend to use VisualStudio to debug our code in windows if the code crashed somewhere. VisualStudio will carry you to the crash line if you run the code in VisualStudio. Does the code crash on your machine? Or it have no output after a long time with low CPU usage?

I think it crashes. I look for the event viewer and see the following error information. 屏幕截图 2023-11-22 171601

Nov 22 '23 09:11 zhengshuo1

I recommend to use VisualStudio to debug our code in windows if the code crashed somewhere. VisualStudio will carry you to the crash line if you run the code in VisualStudio. Does the code crash on your machine? Or it have no output after a long time with low CPU usage?

I used Visual Studio, but it didn't carry me to the crash line. Now I added logs and the crashed line is in jblas_common.hpp. The wrong thing may happen at "mActLauncher.launch". Is the information enough? The error shown in visual studio is in the picture. 屏幕截图 2023-11-23 113028

Nov 23 '23 03:11 zhengshuo1

May you modify https://github.com/intel/intel-extension-for-transformers/blob/4101a8050e465481e7413e812b9d6bdcde64c090/intel_extension_for_transformers/llm/library/jblas/CMakeLists.txt#L26 to target_link_options(${PROJECT_NAME} INTERFACE /STACK:5242880) #Stack requires up to L2 cache size and rebuild the code? Thank you.

Nov 23 '23 06:11 yuchengliu1

May you modify

https://github.com/intel/intel-extension-for-transformers/blob/4101a8050e465481e7413e812b9d6bdcde64c090/intel_extension_for_transformers/llm/library/jblas/CMakeLists.txt#L26

to target_link_options(${PROJECT_NAME} INTERFACE /STACK:5242880) #Stack requires up to L2 cache size and rebuild the code? Thank you.

Your modification is right. The release version doesn't crash now. I will test my original prompt, and compare it with the llama.cpp on Intel Core 13900. Thank you for your patience.

Nov 23 '23 06:11 zhengshuo1

Thank you for your feedback. We partially reproduced your performance in Win11. We will fix the performance problem in a few days. Furthermore, We discovered there is a huge performance gap between WSL and Windows. If you want get our expected performance for now, you can try our code in WSL.

Just as you mentioned, there is a peformance gap between WSL and Windows. I wonder what the reason is. And is it possible to make the performance on Windows the same as that on WSL?

Nov 23 '23 07:11 zhengshuo1

The performance gap is because of background process and the system thread director. Bonding cores to program will improve the performance about 10%. In linux or WSL you can use numactl -C 0,2,4,6,8,10,12,14 ./run_llama ... for your CPU. In Windows you can use start /high /affinity 5555 /b run_llama.exe ... for your CPU (note: this command is only supported in cmd.exe, powershell does not support it). Thank you for your feedback.

Nov 23 '23 07:11 yuchengliu1

The performance gap is because of background process and the system thread director. Bonding cores to program will improve the performance about 10%. In linux or WSL you can use numactl -C 0,2,4,6,8,10,12,14 ./run_llama ... for your CPU. In Windows you can use start /high /affinity 5555 /b run_llama.exe ... for your CPU (note: this command is only supported in cmd.exe, powershell does not support it). Thank you for your feedback.

Now the prompt eval speed is about 94ms per token, and the eval speed is about 216ms per token on Intel Core CPU 13900 of Windows. I think there is a huge gap between the Windows and WSL according to your data. And it is slower than llama.cpp. What other things may I try?

Nov 23 '23 09:11 zhengshuo1

You may check your thread number set -t. You can also see it in "system_info" in output. 8 threads are suitable for your CPU. Can you provide the command and performance of llama.cpp on your machine? We confirm the gap between the Windows and WSL and will fix it later.

Nov 24 '23 07:11 yuchengliu1

You may check your thread number set -t. You can also see it in "system_info" in output. 8 threads are suitable for your CPU. Can you provide the command and performance of llama.cpp on your machine? We confirm the gap between the Windows and WSL and will fix it later.

Intel Core 13900 Intel-extension on Windows model_print_timings: load time = 101892.48 ms model_print_timings: sample time = 35.70 ms / 137 runs ( 0.26 ms per token) model_print_timings: prompt eval time = 100773.12 ms / 1085 tokens ( 92.88 ms per token) model_print_timings: eval time = 32618.47 ms / 136 runs ( 239.84 ms per token) model_print_timings: total time = 134602.71 ms

llama.cpp on Windows llama_print_timings: load time = 1168.37 ms llama_print_timings: sample time = 27.40 ms / 171 runs ( 0.16 ms per token) llama_print_timings: prompt eval time = 58358.42 ms / 1063 tokens ( 54.90 ms per token) llama_print_timings: eval time = 23040.85 ms / 170 runs ( 135.53 ms per token) llama_print_timings: total time = 81520.31 ms

intel-extension on WSL model_print_timings: load time = 12506.71 ms model_print_timings: sample time = 58.97 ms / 185 runs ( 0.32 ms per token) model_print_timings: prompt eval time = 11292.21 ms / 1069 tokens ( 10.56 ms per token) model_print_timings: eval time = 15795.92 ms / 184 runs ( 85.85 ms per token) model_print_timings: total time = 28393.51 ms

llama.cpp on WSL llama_print_timings: load time = 253950.05 ms llama_print_timings: sample time = 51.52 ms / 221 runs ( 0.23 ms per token) llama_print_timings: prompt eval time = 35017.26 ms / 1068 tokens ( 32.79 ms per token) llama_print_timings: eval time = 21137.37 ms / 220 runs ( 96.08 ms per token) llama_print_timings: total time = 56286.25 ms

Nov 24 '23 12:11 zhengshuo1

Thank you for supplying data. This will help us improving ITREX. We also have a test like this and there are some differences. ITREX have a gap as mentioned above, while llama.cpp do not has a gap in our test. Do you set -t 8 for llama.cpp? It may have different behaver because of the hybird-architecture.

Nov 26 '23 15:11 yuchengliu1

intel-extension-for-transformers intel-extension-for-transformers copied to clipboard

Why is the inference time so long on CPU(i9-13900) of Win11?

intel-extension-for-transformers
intel-extension-for-transformers copied to clipboard