intel-extension-for-pytorch icon indicating copy to clipboard operation
intel-extension-for-pytorch copied to clipboard

IPEX takes 10min+ for "warmup" on MTL iGPU

Open cold-blue opened this issue 10 months ago • 1 comments

Describe the issue

On MTL iGPU, for any model (llama-7b, phi-2, etc.), IPEX needs long time latency (10min for f32 and longer for INT8). That means before you perform real inference, you need to run the model one time for “warmup” first. This will be a one-time overhead if you plan to maintain the model in your memory all the time.

This is the latency information recorded for my GNN model. F32 IPEX optimization time (ms): 11.531352996826172 IPEX warmup time (ms): 595983.88409614563

F16 IPEX optimization time (ms): 161458.64391326904 IPEX warmup time (ms): 537853.673673458932

INT8 IPEX optimization time (ms): 17.039060592651367 INT8 JIT converting time (ms): 557900.848865509 IPEX warmup time (ms): 881187.5286102295

The same problem can also be reproduced in any LLM examples of GPU version in IPEX-LLM, like: https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/HF-Transformers-AutoModels/Model/phi-2/generate.py

I think it is not acceptable in real user cases.

cold-blue avatar Apr 24 '24 02:04 cold-blue

Solved by "set SYCL_CACHE_PERSISTENT=1". By doing so, only the first time of inference will take a long time and following running will be much faster.

cold-blue avatar Apr 26 '24 06:04 cold-blue