AutoAWQ
AutoAWQ copied to clipboard
Support Weight-Only quantization on CPU device with QBits backend
Based on the suggestion of https://github.com/casper-hansen/AutoAWQ/issues/390, we have implemented the inference of AWQ model on the CPU device. This PR will support Weight-Only quantization on CPU devices and infernce with QBbits backend. QBits backend has a 'bestla' kernel for CPU gemm op. And QBits is a module of intel-extension-for-transformers package.
hi, @songhan, This PR is based on RFC, could you help review it or assign it to suitable personnel?
Hi @PenghuiCheng and @zhewang1-intc. This is incredibly exciting work. I will attempt to find time soon to properly review this PR and to see how well it works on CPU. I will also try benchmarking the speed on my ~MacBook M2 Pro~ to see how well it can work for local models. EDIT: On another thought, I am not sure it will work on a Mac - will need to test.
Hi @PenghuiCheng and @zhewang1-intc. This is incredibly exciting work. I will attempt to find time soon to properly review this PR and to see how well it works on CPU. I will also try benchmarking the speed on my ~MacBook M2 Pro~ to see how well it can work for local models. EDIT: On another thought, I am not sure it will work on a Mac - will need to test.
it should only can work on x86 CPUs, we tested on Linux(Ubuntu) platform
@casper-hansen Hi, we are not sure if we have done everything appropriately, but we expect your review. Please let us know if there's anything we can do to improve it :smile:
@casper-hansen We are delighted to see your suggestions for this PR, which can help us better understand your requirements. Looking forward to your review.
I want to request performance benchmarks by using examples/benchmark.py
. Can you please run the benchmark so we can assess the speed of the implementation? We can use this to set expectations with users.
Sorry for taking so long. I have been moving apartments, so I have been AFK. I will make sure to prioritize this PR.
Below is the performance benchmark with examples/benchmark.py, this performance is based on the master branch of intel-extension-for-transformers(ITREX) repo, and in the latest code of ITREX, the QBits was updated to the latest version of BestLa kernel. The new version of ITREX will be released soon. We will update AutoAWQ with the new version of ITREX once it is released.:
Batch Size | Prefill Length | Decode Length | Prefill tokens/s | Decode tokens/s | Memory (RAM) | |
---|---|---|---|---|---|---|
casperhansen/mistral-7b-instruct-v0.1-awq | 1 | 64 | 64 | 389.24 | 16.01 | 5.59 GB (0.02%) |
1 | 2048 | 2048 | 1412 | 17.76 | 6.29 GB (0.03% | |
TheBloke/vicuna-7B-v1.5-AWQ | 1 | 64 | 64 | 346 | 18.13 | 8.18 GB (0.03%) |
1 | 2048 | 2048 | 1023.4 | 18.18 | 8.80 GB (0.04%) | |
TheBloke/LLaMA2-13B-Tiefighter-AWQ | 1 | 64 | 64 | 160.24 | 9.87 | 14.65 GB (0.06%) |
1 | 2048 | 2048 | 592.35 | 9.93 | 16.87 GB (0.07%) | |
abhinavkulkarni/mosaicml-mpt-7b-chat-w4-g128-awq | 1 | 64 | 64 | 433.17 | 18.79 | 4.60 GB (0.02%) |
1 | 2048 | 2048 | 404.25 | 19.91 | 4.75 GB (0.02%) | |
casperhansen/falcon-7b-awq | 1 | 64 | 64 | 303.16 | 14.41 | 5.18 GB (0.02%) |
1 | 2048 | 2048 | 634.57 | 15.55 | 5.80 GB (0.02%) | |
TheBloke/CodeLlama-34B-AWQ | 1 | 64 | 64 | 153.73 | 4.23 | 29.00 GB (0.12%) |
1 | 2048 | 2048 | 274.25 | 4.38 | 35.21 GB (0.15%) | |
TheBloke/deepseek-coder-33B-instruct-AWQ | 1 | 64 | 64 | 83.08 | 4.07 | 22.16 GB (0.09%) |
1 | 2048 | 2048 | 296.04 | 4.33 | 37.05 GB (0.16%) |
note: we done this benchmark on INTEL(R) XEON(R) PLATINUM 8592+ with 8-channel 4800MT/s memory.
Benchmarks are looking good for CPU! Thanks for providing them. Can you address the comments I have left?
Benchmarks are looking good for CPU! Thanks for providing them. Can you address the comments I have left?
But I don’t see any new comments 🤔
@casper-hansen hi, could you list your comments? i only notice two, 1: whether qbits works on Mac, 2: perf data, for 1, we only support x86 platform and don't support M1 chip based on ARM arch, for 2, we already updated the benchmark results.
Sorry for the long delay. I have been away the past month, taking time off from open-source.
I pushed a small refactor to the setup to make it easier to install. I also tested the Llama 3 8B model, however, I think the CPU I selected is not good for LLM inference due to low clock speed/low memory bandwidth.
implementation | model | prefill tokens/s | decode tokens/s |
---|---|---|---|
intel extension | llama 3 8b | 5.17 | 1.32 |
native pytorch | llama 3 8b | never finished | x |
Here is the CPU I was able to rent:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 52 bits physical, 57 bits virtual
CPU(s): 17
On-line CPU(s) list: 0-16
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 17
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 25
Model: 160
Model name: AMD EPYC 9754 128-Core Processor
Stepping: 2
CPU MHz: 2246.622
BogoMIPS: 4493.24
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 1.1 MiB
L1i cache: 1.1 MiB
L2 cache: 8.5 MiB
L3 cache: 272 MiB
NUMA node0 CPU(s): 0-16
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid
tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy ab
m sse4a misalignsse 3dnowprefetch osvw perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed ad
x smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr wbnoinvd arat npt nrip_save avx512vbmi umip pk
u avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm arch_capabilities
@casper-hansen hi, we are preparing a blog about ITREX accelerating AutoAWQ inference, could you pls tell us which llama 8b-awq huggingface model you tested on AMD platform? We also want to benchmark on llama3 8b.
This one should work fine. You have to set exllamav2 kernels to true to use it on AMD
https://huggingface.co/casperhansen/llama-3-8b-instruct-awq