nncf
nncf copied to clipboard
Scale estimation/rectification for int4 compression
Changes
Added scale estimation for compression which minimizes L2 error between original MatMul and compressed one.
Reason for changes
Increases accuracy for compressed to 4 bit models.
Related tickets
CVS-129177
Tests
In process
Codecov Report
Attention: Patch coverage is 8.36653% with 230 lines in your changes are missing coverage. Please review.
Project coverage is 29.95%. Comparing base (
17a5b65) to head (f06095e). Report is 5 commits behind head on develop.
Additional details and impacted files
@@ Coverage Diff @@
## develop #2549 +/- ##
============================================
- Coverage 91.19% 29.95% -61.24%
============================================
Files 493 494 +1
Lines 45468 45775 +307
============================================
- Hits 41464 13713 -27751
- Misses 4004 32062 +28058
| Files | Coverage Δ | |
|---|---|---|
| nncf/quantization/advanced_parameters.py | 84.06% <100.00%> (-7.91%) |
:arrow_down: |
| ...ntization/algorithms/weight_compression/backend.py | 0.00% <ø> (-100.00%) |
:arrow_down: |
| nncf/openvino/quantization/quantize_model.py | 0.00% <0.00%> (-61.30%) |
:arrow_down: |
| ...ion/algorithms/weight_compression/torch_backend.py | 0.00% <0.00%> (-84.11%) |
:arrow_down: |
| nncf/torch/quantization/quantize_model.py | 0.00% <0.00%> (-92.50%) |
:arrow_down: |
| nncf/quantization/quantize_model.py | 34.78% <12.50%> (-42.67%) |
:arrow_down: |
| ...ization/algorithms/weight_compression/algorithm.py | 0.00% <0.00%> (-96.49%) |
:arrow_down: |
| .../quantization/algorithms/weight_compression/awq.py | 0.00% <0.00%> (-93.34%) |
:arrow_down: |
| ...n/algorithms/weight_compression/weight_lowering.py | 0.00% <0.00%> (-97.71%) |
:arrow_down: |
| .../algorithms/weight_compression/openvino_backend.py | 0.00% <0.00%> (-98.34%) |
:arrow_down: |
| ... and 1 more |
... and 319 files with indirect coverage changes
| Flag | Coverage Δ | |
|---|---|---|
| COMMON | ? |
|
| ONNX | ? |
|
| OPENVINO | ? |
|
| TENSORFLOW | 29.95% <8.36%> (-0.16%) |
:arrow_down: |
| TORCH | ? |
Flags with carried forward coverage won't be shown. Click here to find out more.
| Components | Coverage Δ | |
|---|---|---|
| common | 76.35% <ø> (-17.42%) |
:arrow_down: |
| torch | 0.01% <0.00%> (-93.59%) |
:arrow_down: |
| tensorflow | 93.74% <ø> (ø) |
|
| onnx | 0.00% <ø> (-93.07%) |
:arrow_down: |
| openvino | 0.00% <0.00%> (-94.19%) |
:arrow_down: |
| ptq | 15.26% <8.43%> (-74.80%) |
:arrow_down: |
| lambada-openai | |||
|---|---|---|---|
| model | precision | acc | ppl |
| stabilityai_stablelm-2-zephyr-1_6b | fp32 | 0.5925 | 6.3024 |
| stabilityai_stablelm-2-zephyr-1_6b | CompressWeightsModeINT4_SYM_r10_gs64_SensitivityMetricMAX_ACTIVATION_VARIANCE_awq_ffn_scale | 0.5696 | 7.4355 |
| stabilityai_stablelm-2-zephyr-1_6b | CompressWeightsModeINT4_SYM_r10_gs64_SensitivityMetricMAX_ACTIVATION_VARIANCE_awq_fnn | 0.5467 | 7.9706 |
| stabilityai_stablelm-2-zephyr-1_6b | int4_sym_r10_gs64_max_activation_variance | 0.5428 | 8.5844 |
| stabilityai_stablelm-3b-4e1t | fp16 | 0.7132 | 3.8192 |
| stabilityai_stablelm-3b-4e1t | CompressWeightsModeINT4_SYM_r10_gs64_SensitivityMetricMAX_ACTIVATION_VARIANCE_awq_ffn_scale | 0.6936 | 4.0961 |
| stabilityai_stablelm-3b-4e1t | int4_sym_r10_gs64_max_activation_variance | 0.685 | 4.324 |
| stabilityai_stablelm-3b-4e1t | CompressWeightsModeINT4_SYM_r10_gs64_SensitivityMetricMAX_ACTIVATION_VARIANCE_awq_fnn | 0.6798 | 4.4316 |
| stable-zephyr-3b-dpo | fp16 | 0.6099 | 6.7151 |
| stable-zephyr-3b-dpo | CompressWeightsModeINT4_SYM_r10_gs64_SensitivityMetricMAX_ACTIVATION_VARIANCE_awq_ffn_scale | 0.5921 | 7.0513 |
| stable-zephyr-3b-dpo | CompressWeightsModeINT4_SYM_r10_gs64_SensitivityMetricMAX_ACTIVATION_VARIANCE_awq_fnn | 0.5736 | 8.3502 |
| stable-zephyr-3b-dpo | int4_sym_r10_gs64_max_activation_variance | 0.5618 | 9.3011 |
| llama-2-7b-chat | fp16 | 0.7108 | 3.262 |
| llama-2-7b-chat | CompressWeightsModeINT4_SYM_r10_gs128_SensitivityMetricMAX_ACTIVATION_VARIANCE_awq_ffn_scale | 0.6911 | 3.5074 |
| llama-2-7b-chat | int4_sym_r10_gs128_max_activation_variance | 0.6885 | 3.5719 |
| llama-2-7b-chat | CompressWeightsModeINT4_SYM_r10_gs128_SensitivityMetricMAX_ACTIVATION_VARIANCE_awq_fnn | 0.6798 | 3.6947 |
| zephyr-7b-beta | fp16 | 0.7345 | 3.1783 |
| zephyr-7b-beta | CompressWeightsModeINT4_SYM_r10_gs128_SensitivityMetricMAX_ACTIVATION_VARIANCE_awq_ffn_scale | 0.7297 | 3.2551 |
| zephyr-7b-beta | CompressWeightsModeINT4_SYM_r10_gs128_SensitivityMetricMAX_ACTIVATION_VARIANCE_awq_fnn | 0.7074 | 3.4549 |
| zephyr-7b-beta | int4_sym_r10_gs128_max_activation_variance | 0.707 | 3.5021 |
Scale estimation algorithm doesn't work for group_size=-1 and fails with no clear message:
in the short term, error about not supported parameter for scale estimation can be enough.
BTW, AWQ works fine with group_size=-1