vllm
vllm copied to clipboard
[Kernel] Initial Activation Quantization Support
Summary
- Initial support for Activation Quantization (specifically static-per tensor for W8A8)
- Adds
CompressedTensorsConfigandCompressedTensorsLinearMethodto support models quantized through sparseml and saved through compressed-tensors - Adds a new optional
layer_nameparameter to create_weights. Thelayer_namecan be used to match the appropriate quantization scheme from theCompressedTensorsConfigfor a given layer - Adds a static-per-tensor quant kernel (Inspired and refactored from https://github.com/vllm-project/vllm/pull/1508)
- Use the nvidia-cutlass python interface to invoke a fused GEMM+dequant kernel.
From Neural Magic, Co-authored by @varun-sundar-rabindranath @robertgshaw2-neuralmagic
IMHO the
layer_nameapproach is simple and effective, but it also creates more complexity to model implementers. Ideally we should match the scheme automatically after the model is initialized (but before weight loading). In this case we need to make all parameters in meta tensor (like a placeholder) until the weights are actually loaded. In this way we can change data type and don't have to worry about memory footprint.
Per our slack discussion:
Plan is to refactor weight_loading logic generically (separate from this PR) with a flow that looks like this:
model = init_model(...) # parameters are in meta tensors
for key, val in scheme:
mod = find_module_by_name(model, key)
config_module(mod, val)
...
weight_loading(model, ckpt)
This is similar to how we do things in SparseML / HF. This would also enable lack of memory savings for fp8
@dsikka can you add some tests for the new functionality? Can any of the tests from #1508 be reused/adapted?
@bnellnm I added some quant kernel tests. We should definitely add some model tests.
Hi guys. Our team member who was previously responsible for PR W8A8 https://github.com/vllm-project/vllm/pull/1508 has left the company. Additionally, since our development focus has shifted to LMDeploy, that PR will not be continued for the time being. And we have recently developed support for SleekQuant on vLLM, which is W4A8, proposed by our team. The precision is comparable to SmoothQuant, and the performance is better than both Marlin and SmoothQuant. We expect to release the source code and technical report by the end of this month. Stay tuned. Cheers.
Thanks for removing the layer name until the weight refactor is ready @dsikka