nncf icon indicating copy to clipboard operation
nncf copied to clipboard

llm weight compression tool

Open andrey-churkin opened this issue 9 months ago • 12 comments

Changes

Add a script to automate the enumeration of compression parameters.

Supported backends for compression: optimum-cli, nncf Supported backends for evaluation: lm_eval, wwb

Reason for changes

Related tickets

Ref: 160664

Tests

andrey-churkin avatar Feb 18 '25 08:02 andrey-churkin

As far as I understand, the main purpose of this tool is to return the top k compression parameters, sorted by increasing the drop between original and compressed models. Could you explain how to get such list of compression parameters.

Yes, the main purpose of this tool is to automate the enumeration of compression parameters. The script saves a results.xlsx file that contains the following sheet (columns are subject to discussion).

image

From this table, we can easily understand which parameters are suitable for our criteria. For all configurations, we save a file called optimum_cli_params.json (for the optimum-cli backend) that contains all the compression parameters that were used. For example, for the int4_r0.2_gs128_auto_awq configuration it contains the following parameters:

{
    "task": "text-generation",
    "trust_remote_code": true,
    "weight_format": "int4",
    "ratio": 0.2,
    "sym": false,
    "group_size": 128,
    "backup_precision": null,
    "dataset": "auto",
    "all_layers": false,
    "awq": true,
    "scale_estimation": false,
    "gptq": false,
    "lora_correction": false
}

andrey-churkin avatar Feb 19 '25 15:02 andrey-churkin

I would suggest thinking about using statistics dumping to speed up model compression under different compression parameters.

Currently, only the optimum-cli backend is supported, and as far as I know, there is no way to dump statistics via it. However, I will take this proposal into account and use caching statistics for the NNCF backend.

andrey-churkin avatar Feb 20 '25 07:02 andrey-churkin

You have divided the task into several steps. The first step is to compress the model with parameters from the grid and save the number of copies of the model equal to the number of parameter sets. The second step is validation. How do you propose to parallelize it? Do you really need a model copy for each set of compression parameters?

I save the model for only one reason: it is needed for validation. We can combine compression and validation into a single task. In this scenario, we probably don't need to save the models, and can save only the compression parameters/metrics. I am open to any suggestions here, and we can discuss and select the best way.

Regarding parallelization, I think one way is to execute several tasks at the same time. A task can contain either only the compression step or both the compression and validation steps. Or we can trigger validation only when a compression step is finished (for some configuration). If you have any other suggestions or ideas, please let me know.

andrey-churkin avatar Feb 20 '25 08:02 andrey-churkin

@nikita-malininn @nikita-savelyevv @ljaljushkin @andreyanufr Guys, please have a look. It is still a raw version, but any feedback will be valuable.

andrey-churkin avatar Feb 20 '25 08:02 andrey-churkin

@nikita-malininn @nikita-savelyevv @ljaljushkin @andreyanufr Guys, please have a look. It is still a raw version, but any feedback will be valuable.

@andrey-churkin

  1. Is tools/llm_weight_compression/config_optimum_lm_eval.json default configuarion ? May be it worth to add scale estimation and ratio in more optimal range [0.7, 0.8, 0.9, 1.0] ?
  2. Do you have plans to add example without optimum-cli ?

andreyanufr avatar Feb 20 '25 09:02 andreyanufr

I would suggest thinking about using statistics dumping to speed up model compression under different compression parameters.

Currently, only the optimum-cli backend is supported, and as far as I know, there is no way to dump statistics via it. However, I will take this proposal into account and use caching statistics for the NNCF backend.

There is a way to make it work without additional changes to optimum-intel:

model = OVModelForCausalLM(<model_id or model_path>, load_in_8bit=False)
OVQuantizer(model).quantize(
    ov_config=OVConfig(quantization_config=OVWeightQuantizationConfig(bits=4, ...)),
    advanced_parameters=nncf.AdvancedCompressionParameters(statistics_path=self.statistics_path),
    save_directory=<save_directory>
)

nikita-savelyevv avatar Feb 20 '25 10:02 nikita-savelyevv

Also, it would be convenient if it was possible to run compression / evaluation steps separately if needed.

nikita-savelyevv avatar Feb 20 '25 11:02 nikita-savelyevv

convenient if it was possible to run compression / evaluation steps separately if

In certain situations, it may be preferable not to re-compress models, but rather to use the existing ones and test them on different datasets. For local testing, it would be helpful to have the option to visualize currently compressed models with their respective parameters and versions (nncf, ov, optimum, transformers, torch...) and the way to select some of them for validation.

ljaljushkin avatar Feb 20 '25 16:02 ljaljushkin

I would suggest thinking about using statistics dumping to speed up model compression under different compression parameters.

Currently, only the optimum-cli backend is supported, and as far as I know, there is no way to dump statistics via it. However, I will take this proposal into account and use caching statistics for the NNCF backend.

I would suggest to think about adding caching statistics in optimum-cli. If you make a suggestion on this matter, it will be a first input. cc' @nikita-savelyevv, @AlexKoff88

alexsu52 avatar Feb 21 '25 06:02 alexsu52

Great tool!!

One idea for improvement I have is to also take resulting model inference speed into consideration. There is always a trade-off between accuracy and performance so for every model the task is to find the fastest model reaching the acceptable accuracy drop.

As the first step we could do an additional performance measuring step after compression with llm_bench and add first and second token latency measurements as columns in the resulting table.

In some future I can imagine that we could do some kind of "smart" parameter search based on this. Because formally speaking what we have here is a min-max optimization task: we minimize latency while maximizing accuracy. The problem is that different parameters have different impact on the target metrics so for me it's not straightforward how exactly it could be possible. Maybe some heuristics will have to be added. In any case, this is just an idea for future improvements, not for right now.

I agree that the approach of trying all parameters without taking into account the heuristics for performance at the initial stage will lead to a significant increase in the time for searching compression parameters.

@andrey-churkin , just to clarify, is the idea to specify the order of experiments into the config?

alexsu52 avatar Feb 21 '25 07:02 alexsu52

Yes, the main purpose of this tool is to automate the enumeration of compression parameters. The script saves a results.xlsx file that contains the following sheet (columns are subject to discussion).

As far as I understand, the main purpose of this tool is to return the top k compression parameters, sorted by increasing the drop between original and compressed models. Could you explain how to get such list of compression parameters.

Yes, the main purpose of this tool is to automate the enumeration of compression parameters. The script saves a results.xlsx file that contains the following sheet (columns are subject to discussion).

image

From this table, we can easily understand which parameters are suitable for our criteria. For all configurations, we save a file called optimum_cli_params.json (for the optimum-cli backend) that contains all the compression parameters that were used. For example, for the int4_r0.2_gs128_auto_awq configuration it contains the following parameters:

{
    "task": "text-generation",
    "trust_remote_code": true,
    "weight_format": "int4",
    "ratio": 0.2,
    "sym": false,
    "group_size": 128,
    "backup_precision": null,
    "dataset": "auto",
    "all_layers": false,
    "awq": true,
    "scale_estimation": false,
    "gptq": false,
    "lora_correction": false
}

Thanks for the explanation. If you want to find compression parameters that satisfy a given drop in accuracy, you should not check all sets of compression parameters. You should select compression parameters with the best performance. Thus I would suggest to introduce "max_accuracy_drop" as early stopper of the searching process.

alexsu52 avatar Feb 21 '25 11:02 alexsu52

You have divided the task into several steps. The first step is to compress the model with parameters from the grid and save the number of copies of the model equal to the number of parameter sets. The second step is validation. How do you propose to parallelize it? Do you really need a model copy for each set of compression parameters?

I save the model for only one reason: it is needed for validation. We can combine compression and validation into a single task. In this scenario, we probably don't need to save the models, and can save only the compression parameters/metrics. I am open to any suggestions here, and we can discuss and select the best way.

Regarding parallelization, I think one way is to execute several tasks at the same time. A task can contain either only the compression step or both the compression and validation steps. Or we can trigger validation only when a compression step is finished (for some configuration). If you have any other suggestions or ideas, please let me know.

I didn't understand, are you suggesting that parallelization be regulated at the config level or will it be implemented inside the tool?

alexsu52 avatar Feb 21 '25 11:02 alexsu52