intel-extension-for-transformers icon indicating copy to clipboard operation
intel-extension-for-transformers copied to clipboard

No performance difference observed using HF Tranformers Vs intel-extension-for-transformers Python API/library

Open amir1m opened this issue 1 year ago • 3 comments
trafficstars

Hello, I am running PEFT+Quantized (BitsAndBytes) Falcon model.

I am trying to follow the instructions of using Python API .

While am able to load the model on my Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz dockerized server, I don't see any inference speed difference between using directly Huggingface transformers and Intel extension Python library.

Is the main purpose of this library to generate quantize and generate model binary to be run on Intel CPU or is it supposed to improve the speed of inference as well? If an inference speed is not focus for this project, is there any library or method that can help speed up the inference on Intel CPUs?

Looking forward the solutions if I am missing something or other possible approaches.

Thanks.

amir1m avatar Nov 23 '23 10:11 amir1m

Hi Thanks for your usage. I found that your CPU type is Skylake, which have relative less ISA support, therefore, have less optimization we can do on this kind of CPU. Could you try ICX CPU or SPR CPU? we have some many advance performance optimizations both on VNNI/AMX ISA. You could get MHA/ FFN fusion, advance JIT kernel dispatch etc features to furthermore improve the performance.

Thanks

a32543254 avatar Nov 24 '23 01:11 a32543254

Hi Thanks for your usage. I found that your CPU type is Skylake, which have relative less ISA support, therefore, have less optimization we can do on this kind of CPU. Could you try ICX CPU or SPR CPU? we have some many advance performance optimizations both on VNNI/AMX ISA. You could get MHA/ FFN fusion, advance JIT kernel dispatch etc features to furthermore improve the performance.

Thanks

Does this support Intel 10700 ?

PedzacyKapec avatar Nov 24 '23 09:11 PedzacyKapec

10700 has AVX2, LLM Runtime support it. It's not just a tool for generating quantied model, we provide fusion and kenrels for inference. You can check the performance data in https://medium.com/@NeuralCompressor/llm-performance-of-intel-extension-for-transformers-f7d061556176

we will check the status of Platinum 8167M, but we need time to find the machine. Can you share your model?

kevinintel avatar Nov 24 '23 09:11 kevinintel

We improve performanc on client cpu, but still can't find a 8167M machine. Close this issue first, you can try the performance now.

kevinintel avatar Jun 05 '24 07:06 kevinintel