Summary

Initial support for Activation Quantization (specifically static-per tensor for W8A8)
Adds CompressedTensorsConfig and CompressedTensorsLinearMethod to support models quantized through sparseml and saved through compressed-tensors
Adds a new optional layer_name parameter to create_weights. The layer_name can be used to match the appropriate quantization scheme from the CompressedTensorsConfig for a given layer
Adds a static-per-tensor quant kernel (Inspired and refactored from https://github.com/vllm-project/vllm/pull/1508)
Use the nvidia-cutlass python interface to invoke a fused GEMM+dequant kernel.

From Neural Magic, Co-authored by @varun-sundar-rabindranath @robertgshaw2-neuralmagic

May 01 '24 13:05 dsikka

IMHO the layer_name approach is simple and effective, but it also creates more complexity to model implementers. Ideally we should match the scheme automatically after the model is initialized (but before weight loading). In this case we need to make all parameters in meta tensor (like a placeholder) until the weights are actually loaded. In this way we can change data type and don't have to worry about memory footprint.

Per our slack discussion:

Plan is to refactor weight_loading logic generically (separate from this PR) with a flow that looks like this:

model = init_model(...) # parameters are in meta tensors
for key, val in scheme:
    mod = find_module_by_name(model, key)
    config_module(mod, val)
...
weight_loading(model, ckpt)

This is similar to how we do things in SparseML / HF. This would also enable lack of memory savings for fp8

May 02 '24 18:05 robertgshaw2-redhat

@dsikka can you add some tests for the new functionality? Can any of the tests from #1508 be reused/adapted?

May 14 '24 20:05 bnellnm

@bnellnm I added some quant kernel tests. We should definitely add some model tests.

May 16 '24 18:05 varun-sundar-rabindranath

Hi guys. Our team member who was previously responsible for PR W8A8 https://github.com/vllm-project/vllm/pull/1508 has left the company. Additionally, since our development focus has shifted to LMDeploy, that PR will not be continued for the time being. And we have recently developed support for SleekQuant on vLLM, which is W4A8, proposed by our team. The precision is comparable to SmoothQuant, and the performance is better than both Marlin and SmoothQuant. We expect to release the source code and technical report by the end of this month. Stay tuned. Cheers.

May 22 '24 04:05 zhyncs

Thanks for removing the layer name until the weight refactor is ready @dsikka

May 23 '24 17:05 pcmoritz

vllm
vllm copied to clipboard

[Kernel] Initial Activation Quantization Support

Summary

vllm vllm copied to clipboard

[Kernel] Initial Activation Quantization Support

Summary

vllm
vllm copied to clipboard