transformers HFQuantizer implementation for compressed-tensors library

This PR adds an HFQuantizer for the compressed-tensors library.

Supported quantization features include:

FP8, INT4, INT8 (for Q/DQ arbitrary precision is allowed for INT)
Activation quantization (static)
Dynamic per-token activation quantization
Supports quantization of arbitrary layer types

Compressed tensors supports running in a Q/DQ format with transformer models and running compressed within vLLM (running compressed with transformers on the roadmap).

This initial PR includes a HFQuantizer and QuantizationConfig inplementations as well as a simple test. Documentation is being added.

To run with this branch, compressed-tensors needs to be currently installed from a development branch (can be released to pypi prior to landing this PR pending support from the transformers team) - pip install https://github.com/neuralmagic/compressed-tensors.git@rename_config.

Happy to provide any further information needed.

Sample model load:

from transformers import AutoModelForCausalLM
compressed_tensors_model = AutoModelForCausalLM.from_pretrained("nm-testing/tinyllama-oneshot-w4a16-group128-v3")

Suggested Reviewers: @SunMarc @younesbelkada

Jun 28 '24 18:06 bfineran

Hi @bfineran, thanks for contributing and sorry for the delay ! From the PR, I see that we are decompressing the model at after loading the quantized model in order to run it. What would it take to run compressed model on transformers ? I don't mind merging this first but I want to make sure that we enable users to quantized model using compressed-library in the end. Happy to discuss more on how to collaborate together over slack if you want !

Jul 10 '24 17:07 SunMarc

Hi @SunMarc right now running compressed is WIP - we've prioritized a very flexible Q/DQ environment to enable a wide range of quantization settings and will likely roll out running compressed scenario by scenario. Will reach out over slack to discuss more about the project and will also update with additional documentation soon.

Jul 10 '24 19:07 bfineran

hi @SunMarc I've updated to address your comments, specifically around the state dict load warnings and expanding the tests (note the second test case covers a sharded state dict)

Aug 20 '24 18:08 bfineran

Can you add the following to CompressedTensorsConfig ? This way we can print the quantization config

    def to_diff_dict(self) -> Dict[str, Any]:
        """
        Removes all attributes from config which correspond to the default config attributes for better readability and
        serializes to a Python dictionary.
        Returns:
            `Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance,
        """
        config_dict = self.to_dict()

        # get the default config dict
        default_config_dict = CompressedTensorsConfig().to_dict()

        serializable_config_dict = {}

        # only serialize values that differ from the default config
        for key, value in config_dict.items():
            if value != default_config_dict[key]:
                serializable_config_dict[key] = value

        return serializable_config_dict

Sep 05 '24 14:09 SunMarc

Gentle ping @ArthurZucker

Sep 13 '24 17:09 SunMarc

Can you rebase on main @Satrat ? The failing might are probably due to that

Sep 13 '24 17:09 SunMarc

Eagerly awaiting this! Great work @neuralmagic team ;)

Sep 16 '24 12:09 jvlinsta

Can you rebase on main @Satrat ? The failing might are probably due to that

@SunMarc done! Looks like the tests are passing now after rebasing

Sep 17 '24 21:09 Satrat

Very eagerly awaiting this merge. Thanks to everyone in involved!

Sep 20 '24 19:09 hyaticua

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sep 21 '24 15:09 HuggingFaceDocBuilderDev

@ArthurZucker Thank you for your feedback! We've updated the compressed_tensors.md addressing the aforementioned points.

Sep 24 '24 18:09 dsikka