example: int4 weight decompression
Description
oneDNN supports INT4 autoGPTQ and AWQ quantization features. This is an example in oneDNN example to demonstrate Matmul INT4 weights decompression support and how to configure the APIs for autoGPTQ and AWQ quantization features. The request originally came from IPEX team: "AWQ (activation-aware quantization) is very popular in the community and we need to support. We need oneDNN INT4 GEMM API support the below input packing approach.The weights is packed in N direction, [K, N/8]; zeros point is packed in both K and N, [K/G, N/8], scale is in K direction [K/G, N].The input data type of weight and zero point is int32 and scale is fp16."
Checklist
General
- [✔ ] Do all unit and benchdnn tests (
make testandmake test_benchdnn_*) pass locally for each commit? - [✔ ] Have you formatted the code using clang-format?
Performance improvements
- [ ] Have you submitted performance data that demonstrates performance improvements? Not yet
New features
- [ ] Have you published an RFC for the new feature? No
- [ ] Was the RFC approved? N/A
- [ ] Have you added relevant tests? N/A
Bug fixes
- [ ] Have you included information on how to reproduce the issue (either in a github issue or in this PR)?
- [ ] Have you added relevant regression tests?
RFC PR
- [ ] Does RFC document follow the template?
- [ ] Have you added a link to the rendered document?
The file name of the example int4_weight_decompression_cmnts.cpp doesn't seem good. What is cmnts?
The file name of the example int4_weight_decompression_cmnts.cpp doesn't seem good. What is cmnts?
Removed the int4_weight_decompression_cmnts.cpp and added int4_weight_decompression,cpp
@rupakroyintel, please make sure commits in your branch comply with contributing guidelines and do not contain merge commits.
@theComputeKid, @mgouicem, looks like PR Checks / title does not catch issue with commit history...
@theComputeKid, @mgouicem, looks like
PR Checks / titledoes not catch issue with commit history...
Let me see what goes off in the jobs. I checked out the branch and ran locally, and it properly catches the first improper message.
> git remote add rupakroy https://github.com/rupakroyintel/oneDNN.git
> git fetch rupakroy
> git co add_int4_decompression_example
Updating files: 100% (776/776), done.
branch 'add_int4_decompression_example' set up to track 'rupakroy/add_int4_decompression_example'.
Switched to a new branch 'add_int4_decompression_example'
> python3 ./.github/automation/commit-msg-check.py "1abe160095ef52c7ad879b75331dbe4b4e17be6d" "1fe8ee54b18c764d32932d21e776a86f46a6d0cf"
msg: Merge branch 'add_int4_decompression_example' of https://github.com/rupakroyintel/oneDNN into add_int4_decompression_example
Traceback (most recent call last):
File "./.github/automation/commit-msg-check.py", line 82, in <module>
main()
File "./.github/automation/commit-msg-check.py", line 77, in main
__numCharacterCheck(commit_msg)
File "./.github/automation/commit-msg-check.py", line 58, in __numCharacterCheck
raise ValueError(
ValueError: Please see contribution guidelines. Message summary must be less than 72. Got: 124
@vpirogov @dzarukin We tried translating packed 8 int4 values into a single int value. However, it looks like the zero-points attribute wei:per_ocic:s4:32x8 is not supported. Here is the output from benchdnn:
./tests/benchdnn/benchdnn --matmul --engine=gpu --dt=f16:s4:f16 --stag=any --wtag=abc --dtag=acb --attr-scales=wei:per_ocic:f16:32x1 --attr-zero-points=wei:per_ocic:s4:32x8 --attr-fpmath=f16:true 7x24x32:7x32x64
Error: Function 'check_dnnl_status' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/dnnl_common.hpp:327) returned 'unimplemented'
Error: Function 'create_primitive' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/dnnl_common.hpp:401) returned '1'
Error: Function 'init_prim' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/dnnl_common.hpp:471) returned '1'
Error: Function 'createit' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/matmul/matmul.cpp:881) returned '1'
Error: Function 'create' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/utils/task.hpp:49) returned '1'
0:UNIMPLEMENTED __REPRO: --matmul --engine=gpu --dt=f16:s4:f16 --wtag=abc --dtag=acb --attr-scales=wei:per_ocic:f16:32x1 --attr-zero-points=wei:per_ocic:s4:32x8 --attr-fpmath=f16:true 7x24x32:7x32x64
tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:1 invalid_arguments:0 failed:1 listed:0
total: 0.05s; fill: 0.00s (0%); compute_ref: 0.00s (0%); compare: 0.00s (0%);
@vpirogov @dzarukin We tried translating packed 8 int4 values into a single int value. However, it looks like the zero-points attribute wei:per_ocic:s4:32x8 is not supported. Here is the output from benchdnn:
./tests/benchdnn/benchdnn --matmul --engine=gpu --dt=f16:s4:f16 --stag=any --wtag=abc --dtag=acb --attr-scales=wei:per_ocic:f16:32x1 --attr-zero-points=wei:per_ocic:s4:32x8 --attr-fpmath=f16:true 7x24x32:7x32x64 Error: Function 'check_dnnl_status' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/dnnl_common.hpp:327) returned 'unimplemented' Error: Function 'create_primitive' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/dnnl_common.hpp:401) returned '1' Error: Function 'init_prim' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/dnnl_common.hpp:471) returned '1' Error: Function 'createit' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/matmul/matmul.cpp:881) returned '1' Error: Function 'create' at (/home/intel/rroy/int4_decompression/oneDNN/tests/benchdnn/utils/task.hpp:49) returned '1' 0:UNIMPLEMENTED __REPRO: --matmul --engine=gpu --dt=f16:s4:f16 --wtag=abc --dtag=acb --attr-scales=wei:per_ocic:f16:32x1 --attr-zero-points=wei:per_ocic:s4:32x8 --attr-fpmath=f16:true 7x24x32:7x32x64 tests:1 passed:0 skipped:0 mistrusted:0 unimplemented:1 invalid_arguments:0 failed:1 listed:0 total: 0.05s; fill: 0.00s (0%); compute_ref: 0.00s (0%); compare: 0.00s (0%);
@rupakroyintel, oneDNN doesn't have any idea about external to it 8-int4 values packing implementation detail. Zero-point group API is not designed for it. From oneDNN perspective you need to think about each value independently and use a single dimension in groups. The observed benchdnn output is expected.
@dzarukin @vpirogov @shu1chen I have pushed the latest changes. The example passed on GPU. However, it failed on CPU. I have added the verbose log for the failing case:
ONEDNN_VERBOSE=all ./tutorials-matmul-int4-weight-decompression-cpp ... onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512_amx_fp16,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:114 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512_amx,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:114 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx10_1_512,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:114 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx512_core_bf16,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:114 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx512_core_vnni,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:114 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx512_core,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:114 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx2_vnni_2,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported isa,src/cpu/x64/matmul/brgemm_matmul.cpp:114 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx2_vnni,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:115 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:f32,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported datatype combination,src/cpu/matmul/gemm_f32_matmul.cpp:93 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit:bf16,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported datatype combination,src/cpu/matmul/gemm_bf16_matmul.cpp:63 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,gemm:jit,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported datatype combination,src/cpu/matmul/gemm_x8s8s32x_matmul.cpp:110 onednn_verbose,v1,primitive,create:dispatch,matmul,cpu,matmul,brg_matmul:avx2,undef,src:f16::blocked:ab::f0 wei:s4::blocked:ab::f0 dst:f16::blocked:ab::f0,attr-fpmath:f16:true attr-scales:wei:3:f16:48x1 attr-zero-points:wei:3:s4:24x1,,100x96:96x1000,unsupported datatype combination,src/cpu/x64/matmul/brgemm_matmul.cpp:115 oneDNN error caught: Status: unimplemented Message: could not create a primitive descriptor for the matmul primitive. Run workload with environment variable ONEDNN_VERBOSE=all to get additional diagnostic information. Example failed on CPU.
@rupakroyintel Decompression is not implemented for AVX2.
Suggest to update the call to a primitive descriptor constructor utilizing allow_empty argument and checking if it's empty, then successfully finish the example with the unsupported message.
Since this datatype combination is unsupported on CPU, another option is to skip the test on CPU and only enable this example on GPU, similar to what is done in weights_decompression_matmul.cpp#L193-#L195:
// CPU is not supported
if (engine_kind != engine::kind::gpu) return 0;
It would also be great to add the link to this new example in the documentation matmul.md, examples.md.
@vpirogov @dzarukin @shu1chen I have made the changes based on the reviews. Can you please check and approve the changes?
Closing as stale.