intel-extension-for-pytorch IPEX.LLM performance benchmark and 3rd Gen Intel Xeon

Describe the issue

Hi. I would like to confirm which IPEX.LLM benchmark scenarios from 2.2 release are supposed to work properly on the servers with 3rd Gen Intel Xeon (codename: Ice Lake).

Currently I have mostly failures when trying to run all scenarios above with Llama2 model on 3rd Gen Intel Xeon server. I know 3rd Gen Xeon has AVX-512/VNNI available, and not AMX, however it is not clear in IPEX documentation which scenarios (including distributed ones with deepspeed) are supposed to work without AMX available.

So please confirm for each point, from 4.1.1.2 Run in FP32 with ipex.llm to 4.1.1.8 Run in weight-only quantization INT8 with ipex.llm in distributed way. Thank you.

Feb 28 '24 16:02 mgajzler

What type of failures are you running into? IPEX should be functional on ICX. You just need to ensure that it has proper instruction sets for lower precision datatypes (i.e. AVX512-BF16/VNNI)

Feb 28 '24 21:02 kminhta

AFAIK, BF16 is not supported/accelerated by AVX512_VNNI on 3rd Gen Intel Xeon (codename: Ice Lake); VNNI supports INT8 only on 3rd Gen.

Please confirm, does "--quant-with-amp" parameter requires AMX? (which is available in 4th or 5th Gen Intel Xeon).

Probably it is important to mention that I'm trying to run these IPEX.LLM benchmark scenarios on virtualized Linux in VMs on top of VMware vSphere (latest version, 8.0 U2).

I will share the outcomes in my next comment...

Feb 29 '24 18:02 mgajzler

Let's skip single instance scenarios for a while, as "distributed" inference via DeepSpeed is more interesting for dual-socket systems.

On a virtualized Linux VM (details below) on top of VMware vSphere (8.0 U2), IPEX.LLM 2.2 with DeepSpeed provides these results for Llama7-7B inference (input token size: 1024, output token size: 256, batch size: 1):

FP32: Inference latency: 67.475 s 1st token latency: 0.432 s 2nd token latency: 0.263 s

BF16: Inference latency: 23.150 s 1st token latency: 0.215 s 2nd token latency: 0.090 s

INT8 weight-only quantization, without AMP (without --quant-with-amp parameter) Inference latency: 331.895 s 1st token latency: 2.278 s 2nd token latency: 1.293 s

INT8 weight-only quantization, with AMP (with --quant-with-amp parameter) Inference latency: 328.818 s 1st token latency: 2.198 s 2nd token latency: 1.281 s

I can share more token/BS combinations, however based on the above one it seems something is wrong with INT8 weight-only quantization (at least for Llama2). Please verify this on your side (on bare-metal), if possible. Or at least please confirm does INT8 weight-only quantization is supported on the systems with 3rd Gen Intel Xeon (codename: Ice Lake)? Thank you.

More details: VM has 32 vCPUs assigned - 2 vSockets (and 2 vNUMA nodes), 16 vCores per each vSocket. For each precision, IPEX.LLM 2.2 fully utilized all 32 vCores during its runs.

collect_env.py output:

Collecting environment information...
PyTorch version: 2.2.0+cpu
PyTorch CXX11 ABI: No
IPEX version: 2.2.0+cpu
IPEX commit: 211813b
Build type: Release

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: N/A
IGC version: N/A
CMake version: version 3.26.4
Libc version: glibc-2.35

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-100-generic-x86_64-with-glibc2.35
Is XPU available: False
DPCPP runtime version: N/A
MKL version: N/A
GPU models and configuration: 

Intel OpenCL ICD version: N/A
Level Zero version: N/A

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      45 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             32
On-line CPU(s) list:                0-31
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz
CPU family:                         6
Model:                              106
Thread(s) per core:                 1
Core(s) per socket:                 16
Socket(s):                          2
Stepping:                           6
BogoMIPS:                           5786.40
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm md_clear flush_l1d arch_capabilities
Hypervisor vendor:                  VMware
Virtualization type:                full
L1d cache:                          1.5 MiB (32 instances)
L1i cache:                          1 MiB (32 instances)
L2 cache:                           40 MiB (32 instances)
L3 cache:                           48 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-15
NUMA node1 CPU(s):                  16-31
Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status
Vulnerability Itlb multihit:        KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] intel-extension-for-pytorch==2.2.0+cpu
[pip3] numpy==1.26.4
[pip3] torch==2.2.0+cpu
[conda] intel-extension-for-pytorch 2.2.0+cpu                pypi_0    pypi
[conda] mkl                       2023.1.0         h213fc3f_46344  
[conda] numpy                     1.26.4                   pypi_0    pypi
[conda] torch                     2.2.0+cpu                pypi_0    pypi

Mar 20 '24 16:03 mgajzler