nncf [WC] Align compression subgraphs for both weight input data types

Changes

When compression is applied to a model saved with FP32 weights, the resulting graph is different compared to the case when an input model is saved with FP16 weights. This PR aligns these two cases and makes compression subgraph equal for them. This subgraph is below. Weight, scale and zero point are converted to FP32. The Convert node after Multiply which is present in FP16 input case is bypassed.

Feb 29 '24 14:02 nikita-savelyevv

Codecov Report

Attention: Patch coverage is 0% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 77.93%. Comparing base (573b0c3) to head (4ce510a). Report is 1 commits behind head on develop.

Additional details and impacted files

@@             Coverage Diff              @@
##           develop    #2537       +/-   ##
============================================
- Coverage    90.87%   77.93%   -12.94%     
============================================
  Files          494      494               
  Lines        45612    45416      -196     
============================================
- Hits         41449    35397     -6052     
- Misses        4163    10019     +5856

Files	Coverage Δ
.../algorithms/weight_compression/openvino_backend.py	`0.00% <0.00%> (-98.34%)`	:arrow_down:

... and 107 files with indirect coverage changes

Flag	Coverage Δ
COMMON	`?`
ONNX	`?`
OPENVINO	`?`
TENSORFLOW	`30.10% <0.00%> (ø)`
TORCH	`65.96% <0.00%> (-0.01%)`	:arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Components	Coverage Δ
common	`88.28% <ø> (-5.47%)`	:arrow_down:
torch	`93.49% <ø> (-0.01%)`	:arrow_down:
tensorflow	`93.74% <ø> (+1.00%)`	:arrow_up:
onnx	`0.00% <ø> (-93.09%)`	:arrow_down:
openvino	`25.70% <0.00%> (-68.47%)`	:arrow_down:
ptq	`53.06% <0.00%> (-37.03%)`	:arrow_down:

Feb 29 '24 14:02 codecov[bot]

WC manual test fails until #2569 is not merged.

Mar 19 '24 15:03 nikita-savelyevv

post training weight compression test build 34 is green

Mar 27 '24 10:03 nikita-savelyevv

@alexsu52 @nikita-savelyevv I've measured time for compression and total time for different weight compression cases: develop q current PR

Seems like model inference takes almost twice longer on the validation dataset. Does it mean that compressed model should be saved differently in tests and on customer side? https://github.com/openvinotoolkit/nncf/blob/develop/tests/post_training/pipelines/lm_weight_compression.py#L174

Mar 27 '24 20:03 ljaljushkin

@ljaljushkin Thanks for highlighting this!

The reason behind this is that during compression with group size, there is an additional Reshape node. In this PR, a Convert f16>f32 node is added after scale Multiply node. If Convert is added before Reshape node, then the performance drops. To fix this, I moved Convert node after Reshape node.

Before	After

With this, performance is maintained after changes in the PR:

Test case	Total time develop branch	Total time PR branch
tinyllama_data_free	04:18	04:21
tinyllama_data_aware	04:06	04:07
tinyllama_data_aware_awq	03:33	03:39
tinyllama_data_aware_awq_stateful	03:03	03:03

Apr 04 '24 19:04 nikita-savelyevv

post_training_weight_compression test build 42 is green. Waiting for results of OV validation across different hardware.

Apr 05 '24 08:04 nikita-savelyevv

nncf nncf copied to clipboard

[WC] Align compression subgraphs for both weight input data types

Changes

Codecov Report

nncf
nncf copied to clipboard