HFQuantizer implementation for compressed-tensors library
This PR adds an HFQuantizer for the compressed-tensors library.
Supported quantization features include:
- FP8, INT4, INT8 (for Q/DQ arbitrary precision is allowed for INT)
- Activation quantization (static)
- Dynamic per-token activation quantization
- Supports quantization of arbitrary layer types
Compressed tensors supports running in a Q/DQ format with transformer models and running compressed within vLLM (running compressed with transformers on the roadmap).
This initial PR includes a HFQuantizer and QuantizationConfig inplementations as well as a simple test. Documentation is being added.
To run with this branch, compressed-tensors needs to be currently installed from a development branch (can be released to pypi prior to landing this PR pending support from the transformers team) - pip install https://github.com/neuralmagic/compressed-tensors.git@rename_config.
Happy to provide any further information needed.
Sample model load:
from transformers import AutoModelForCausalLM
compressed_tensors_model = AutoModelForCausalLM.from_pretrained("nm-testing/tinyllama-oneshot-w4a16-group128-v3")
Suggested Reviewers: @SunMarc @younesbelkada
Hi @bfineran, thanks for contributing and sorry for the delay ! From the PR, I see that we are decompressing the model at after loading the quantized model in order to run it. What would it take to run compressed model on transformers ? I don't mind merging this first but I want to make sure that we enable users to quantized model using compressed-library in the end. Happy to discuss more on how to collaborate together over slack if you want !
Hi @SunMarc right now running compressed is WIP - we've prioritized a very flexible Q/DQ environment to enable a wide range of quantization settings and will likely roll out running compressed scenario by scenario. Will reach out over slack to discuss more about the project and will also update with additional documentation soon.
hi @SunMarc I've updated to address your comments, specifically around the state dict load warnings and expanding the tests (note the second test case covers a sharded state dict)
Can you add the following to CompressedTensorsConfig ? This way we can print the quantization config
def to_diff_dict(self) -> Dict[str, Any]:
"""
Removes all attributes from config which correspond to the default config attributes for better readability and
serializes to a Python dictionary.
Returns:
`Dict[str, Any]`: Dictionary of all the attributes that make up this configuration instance,
"""
config_dict = self.to_dict()
# get the default config dict
default_config_dict = CompressedTensorsConfig().to_dict()
serializable_config_dict = {}
# only serialize values that differ from the default config
for key, value in config_dict.items():
if value != default_config_dict[key]:
serializable_config_dict[key] = value
return serializable_config_dict
Gentle ping @ArthurZucker
Can you rebase on main @Satrat ? The failing might are probably due to that
Eagerly awaiting this! Great work @neuralmagic team ;)
Can you rebase on main @Satrat ? The failing might are probably due to that
@SunMarc done! Looks like the tests are passing now after rebasing
Very eagerly awaiting this merge. Thanks to everyone in involved!
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
@ArthurZucker Thank you for your feedback! We've updated the compressed_tensors.md addressing the aforementioned points.