nncf
nncf copied to clipboard
[WeightCompression] Statistics caching
Changes
Add statistics saving and loading for the WeightCompression
algorithm:
- Statistics are cached for all possible configurations such as:
gptq = True, awq = True, scale_estimation=True with all type of sensitivities.
- Then the statistics are dumped in a file which can be reused for any
weights_compression()
configuration.
The example for tinyllama was updated with this functionality.
More changes:
- Make all statistics used in
WeightCompression
aligned withTesnorStatistics
fromnncf/experimental/common/tensor_statistics/statistics.py
- Extend
StatisticsAggregator
by the logic of loading and saving statistics. - Dumping statistics is done using pickle and gzip. Serialization methods were added for
Tensor
. - Introduced
StatisticsSerializer
to handle the statistics loading/dumping to the file.
Statistics sizes
Model | subset size | statistics file size | statistics collection time |
---|---|---|---|
tinyllama | 128 | 298 MB | 61 sec |
Reason for changes
Speed up compression configuration finding.
Related tickets
153129
Tests
Test coverage were extended by tests on StatisticsSerializer
, StatisticsAggregator
and on WeightCompression
algorithm with the proposed functional.