ENV setting issue for Arc A770 about inference chatglm3-6b
this configuration as below cannot bring performance promotion, but more slower.
bash export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
setting this ENV variable, the performance is as below:
2024-02-08 13:46:16,388 - INFO - Converting the current model to sym_int4 format......
<class 'transformers_modules.modeling_chatglm.ChatGLMForConditionalGeneration'>
=========First token cost 4.7585 s and 5.5625 GB=========
=========Rest tokens cost average 0.0341 s (159 tokens in all) and 5.5625 GB=========
=========First token cost 0.3584 s and 5.5625 GB=========
=========Rest tokens cost average 0.0326 s (159 tokens in all) and 5.5625 GB=========
Inference time: 5.5508105754852295 s
Remove this ENV variable, the performance is as below:
2024-02-08 13:47:39,833 - INFO - Converting the current model to sym_int4 format......
<class 'transformers_modules.modeling_chatglm.ChatGLMForConditionalGeneration'>
=========First token cost 4.1724 s and 5.5625 GB=========
=========Rest tokens cost average 0.0205 s (159 tokens in all) and 5.5625 GB=========
=========First token cost 0.3476 s and 5.5625 GB=========
=========Rest tokens cost average 0.0202 s (159 tokens in all) and 5.5625 GB=========
Inference time: 3.5677762031555176 s
Hi @Fred-cell
Seems that I can't reproduce your issue on our machine (arc09:/home/arda/kai/BigDL/python/llm/dev/benchmark/all-in-one), kernel 6.2 and ipex 2.1 and bigdl-llm 2.5.0b20240207).
With these two environment variables, the performance on our machine improves by 1~2ms for rest token. We will go to your machine to verify this issue after the holiday.
Fixed in https://github.com/intel-analytics/ipex-llm/pull/10566