intel-extension-for-pytorch intel-extension-for-pytorch 2.1.0.dev+cpu.llm experimental rehabilitation

Describe the issue

Hello, I recently in emersion llama experiment (https://intel.github.io/intel-extension-for-pytorch/llm/cpu/), hope in int8 quantization can reach this article(https://www.intel.com/content/www/us/en/developer/articles/technical/accelerate-llama2-ai-hardware-sw-optimizations.html) mentioned in the 35 ms/token. At present, I am using an example to test on AWS, using an instance of type m7i.metal-24xl with a delay of 45ms/token. May I ask how to replicate the result of 35ms/token on AWS?

The document environment is: 4th Gen Intel Xeon 8480 \2 socket\112 cores\224 threads\

My experimental environment is: AWS instance（m7i.metal-24xl） 4th Gen Intel Xeon 8488C \1 socket\48 cores\96 threads\

Feb 26 '24 02:02 duyanyao

Hi, I suggest taking a look at https://github.com/intel/intel-extension-for-pytorch/tree/main/examples/cpu/inference/python/llm for best known practices for our llm optimizations as well as https://intel.github.io/intel-extension-for-pytorch/cpu/2.2.0+cpu/tutorials/performance_tuning/tuning_guide.html for performance tuning

In general, performance is dependent on many factors and it may be challenging to completely replicate results unless you have an exact, or very similar set up

Feb 27 '24 22:02 kminhta

hi @duyanyao , were you able to achieve the performance numbers reported using the BKMs suggested?

Aug 26 '24 03:08 srinarayan-srikanthan