intel-extension-for-pytorch
intel-extension-for-pytorch copied to clipboard
IPEX takes 10min+ for "warmup" on MTL iGPU
Describe the issue
On MTL iGPU, for any model (llama-7b, phi-2, etc.), IPEX needs long time latency (10min for f32 and longer for INT8). That means before you perform real inference, you need to run the model one time for “warmup” first. This will be a one-time overhead if you plan to maintain the model in your memory all the time.
This is the latency information recorded for my GNN model. F32 IPEX optimization time (ms): 11.531352996826172 IPEX warmup time (ms): 595983.88409614563
F16 IPEX optimization time (ms): 161458.64391326904 IPEX warmup time (ms): 537853.673673458932
INT8 IPEX optimization time (ms): 17.039060592651367 INT8 JIT converting time (ms): 557900.848865509 IPEX warmup time (ms): 881187.5286102295
The same problem can also be reproduced in any LLM examples of GPU version in IPEX-LLM, like: https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2/generate.py
I think it is not acceptable in real user cases.
Solved by "set SYCL_CACHE_PERSISTENT=1". By doing so, only the first time of inference will take a long time and following running will be much faster.