[Question] About CPU performance

Open mingfeima opened this issue 1 year ago • 0 comments

Hi, I am an engineer from Intel and I work mostly on the performance optimization of PyTorch on intel Xeon CPUs (also I am the pytorch module maintainer for cpu performance). Just come across this amazing project and from this blog fast-llama-2-on-cpus-with-sparse-fine-tuning-and-deepsparse the chart says DeepSparse accelerates the sparse-quantized Llama models to 6-8x faster over the dense FP32 baseline.

The 6-8x speedup of sparse model against dense model is a fascinating result. My purpose is to check if there is a chance to further improve the performance with our previous effort on LLM optimizations.

I run according the script from https://github.com/neuralmagic/deepsparse?tab=readme-ov-file#try-it-now, however from the hardware profiler I can tell the hardware efficiency is still not very high (only ~12 cores in use on average from a 40-core machine, leading to significant sync overhead and very high CPI (cycles per instructions)). Maybe I can do something to improve this, but I am not very familiar with this codebase, and I need some guidance here:

how can I reproduce the above results?
how the model is deployed ? with onnx-runtime?

Additionally, do you continue this sparse fine tuning job on other models, for example Llama3 ? Also how about int4 ?

Sep 03 '24 02:09 mingfeima