ipex-llm Arc A series configuration issue about export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE

set this config, the latency of chatGLM3-6b's next token became slower from 20.4ms to 34.5ms with 1k input prompt, 512 token size. <class 'transformers_modules.modeling_chatglm.ChatGLMForConditionalGeneration'> =========First token cost 4.1437 s and 5.5625 GB========= =========Rest tokens cost average 0.0208 s (159 tokens in all) and 5.5625 GB========= =========First token cost 0.3410 s and 5.5625 GB========= =========Rest tokens cost average 0.0204 s (159 tokens in all) and 5.5625 GB========= export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1, the result is as below: <class 'transformers_modules.modeling_chatglm.ChatGLMForConditionalGeneration'> =========First token cost 4.1536 s and 5.5625 GB========= =========Rest tokens cost average 0.0350 s (159 tokens in all) and 5.5625 GB========= =========First token cost 0.3509 s and 5.5625 GB========= =========Rest tokens cost average 0.0347 s (159 tokens in all) and 5.5625 GB=========

Feb 18 '24 06:02 Fred-cell

Same issue as https://github.com/intel-analytics/BigDL/issues/10133 This issue further detects the issue is impacted by SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS only. We will look into this in your environment. Update: https://github.com/analytics-zoo/nano/issues/802#issuecomment-1837953169 Seems this issue is previously detected in the user's environment, may not only for chatglm3.

Feb 19 '24 01:02 hkvision

Can reproduce this issue on this machine, the differences against our machine (arc08):

driver versions are the same
kernel 6.5 vs 6.2 (ours)
CPU: Intel(R) Xeon(R) w5-3435X VS 13th Gen Intel(R) Core(TM) i9-13900K (ours)

Most likely it is due to the current driver not adapting higher kernel version than 6.2

Feb 19 '24 08:02 hkvision

Fixed in: https://github.com/intel-analytics/ipex-llm/pull/10566

Apr 02 '24 11:04 hkvision

Arc A series configuration issue about export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1