nncf
nncf copied to clipboard
[WC] Align compression subgraphs for both weight input data types
Changes
When compression is applied to a model saved with FP32 weights, the resulting graph is different compared to the case when an input model is saved with FP16 weights. This PR aligns these two cases and makes compression subgraph equal for them. This subgraph is below. Weight, scale and zero point are converted to FP32. The Convert node after Multiply which is present in FP16 input case is bypassed.
Codecov Report
Attention: Patch coverage is 0%
with 7 lines
in your changes are missing coverage. Please review.
Project coverage is 77.93%. Comparing base (
573b0c3
) to head (4ce510a
). Report is 1 commits behind head on develop.
Additional details and impacted files
@@ Coverage Diff @@
## develop #2537 +/- ##
============================================
- Coverage 90.87% 77.93% -12.94%
============================================
Files 494 494
Lines 45612 45416 -196
============================================
- Hits 41449 35397 -6052
- Misses 4163 10019 +5856
Files | Coverage Δ | |
---|---|---|
.../algorithms/weight_compression/openvino_backend.py | 0.00% <0.00%> (-98.34%) |
:arrow_down: |
... and 107 files with indirect coverage changes
Flag | Coverage Δ | |
---|---|---|
COMMON | ? |
|
ONNX | ? |
|
OPENVINO | ? |
|
TENSORFLOW | 30.10% <0.00%> (ø) |
|
TORCH | 65.96% <0.00%> (-0.01%) |
:arrow_down: |
Flags with carried forward coverage won't be shown. Click here to find out more.
Components | Coverage Δ | |
---|---|---|
common | 88.28% <ø> (-5.47%) |
:arrow_down: |
torch | 93.49% <ø> (-0.01%) |
:arrow_down: |
tensorflow | 93.74% <ø> (+1.00%) |
:arrow_up: |
onnx | 0.00% <ø> (-93.09%) |
:arrow_down: |
openvino | 25.70% <0.00%> (-68.47%) |
:arrow_down: |
ptq | 53.06% <0.00%> (-37.03%) |
:arrow_down: |
WC manual test fails until #2569 is not merged.
post training weight compression test build 34 is green
@alexsu52 @nikita-savelyevv
I've measured time for compression and total time for different weight compression cases:
develop
q
current PR
Seems like model inference takes almost twice longer on the validation dataset. Does it mean that compressed model should be saved differently in tests and on customer side? https://github.com/openvinotoolkit/nncf/blob/develop/tests/post_training/pipelines/lm_weight_compression.py#L174
@ljaljushkin Thanks for highlighting this!
The reason behind this is that during compression with group size, there is an additional Reshape node. In this PR, a Convert f16>f32 node is added after scale Multiply node. If Convert is added before Reshape node, then the performance drops. To fix this, I moved Convert node after Reshape node.
Before | After |
---|---|
With this, performance is maintained after changes in the PR:
Test case | Total time develop branch |
Total time PR branch |
---|---|---|
tinyllama_data_free | 04:18 | 04:21 |
tinyllama_data_aware | 04:06 | 04:07 |
tinyllama_data_aware_awq | 03:33 | 03:39 |
tinyllama_data_aware_awq_stateful | 03:03 | 03:03 |
post_training_weight_compression test build 42 is green. Waiting for results of OV validation across different hardware.