nm-vllm icon indicating copy to clipboard operation
nm-vllm copied to clipboard

[Activation Quantization] Dynamic Per Token Support

Open dsikka opened this issue 1 year ago • 0 comments

Summary

  • Add a CompressedTensorsW8A8DynamicToken scheme to support dynamic-per token activation quantization
  • Update config parsing to support updates made to the config.json / quantization config provided with the model
  • Update config parsing logic to pull in functionality from compressed_tensors; add in compressed_tensors as a requirement
  • Update/add in logic for llama layer mappings when dealing with the ignore list
  • Update to use QuantizationArgs directly from `compressed_tensors

TODO:

  • config naming issue between sparseml and vllm

dsikka avatar May 06 '24 18:05 dsikka