nm-vllm
nm-vllm copied to clipboard
[Activation Quantization] Dynamic Per Token Support
Summary
- Add a
CompressedTensorsW8A8DynamicTokenscheme to support dynamic-per token activation quantization - Update config parsing to support updates made to the
config.json/ quantization config provided with the model - Update config parsing logic to pull in functionality from
compressed_tensors; add incompressed_tensorsas a requirement - Update/add in logic for llama layer mappings when dealing with the
ignorelist - Update to use
QuantizationArgsdirectly from `compressed_tensors
TODO:
- config naming issue between sparseml and vllm