[Activation Quantization] Dynamic Per Token Support

Open dsikka opened this issue 1 year ago • 0 comments

Summary

Add a CompressedTensorsW8A8DynamicToken scheme to support dynamic-per token activation quantization
Update config parsing to support updates made to the config.json / quantization config provided with the model
Update config parsing logic to pull in functionality from compressed_tensors; add in compressed_tensors as a requirement
Update/add in logic for llama layer mappings when dealing with the ignore list
Update to use QuantizationArgs directly from `compressed_tensors

May 06 '24 18:05 dsikka