[WIP] SmoothQuant using tensor subclassing
Still WIP
The implementation of SmoothQuant with tensor subclassing (AffineQuantizedTensor) is similar to that of AWQ with the following differences:
- SmoothQuant supports both static and dynamic quantization of activation while AWQ only uses dynamic
- The smoothing factor is calculated differently from the equalization scales of AWQ
:link: Helpful Links
:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1030
- :page_facing_up: Preview Python docs built from this PR
Note: Links to docs will display an error until the docs builds have been completed.
:white_check_mark: No Failures
As of commit cb9167a880286c8f07c2771fc1d3109ea5988ee8 with merge base d4b2f334ee4e8b3b1ba75569b80ddba8bdf8fd6a ():
:green_heart: Looks good so far! There are no failures yet. :green_heart:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
Hi @jerryzh168 I added a new tensor subclass LinearActivationScaleQuantizedTensor to support x -> x/scale -> quantize x for torch.compile.
If I use LinearActivationQuantizedTensor, the x/sacle is done outside the class (by input_quant_func) and there is a dynamo error about scale during torch.compile. I guess it's because the scale tensors are not on the graph in this case. Putting the scale in the weight tensor solves the problem. And WeightTensorWithLinearActivationScaleMetadata does not quantize activation.
Do you have any concern adding this new class? Thanks.
Hi @jerryzh168 I added a new tensor subclass
LinearActivationScaleQuantizedTensorto supportx -> x/scale -> quantize xfortorch.compile.If I use
LinearActivationQuantizedTensor, thex/sacleis done outside the class (byinput_quant_func) and there is a dynamo error aboutscaleduringtorch.compile. I guess it's because thescaletensors are not on the graph in this case. Putting thescalein the weight tensor solves the problem. AndWeightTensorWithLinearActivationScaleMetadatadoes not quantize activation.Do you have any concern adding this new class? Thanks.
I think in this case we should be composing WeightTensorWithLinearActivationScaleMetadata LinearActivationQuantizedTensor together, i.e.
weight = to_affine_quantized(float_weight, ...)
# this will quantize input
# use https://github.com/pytorch/ao/blob/c87cc9b7286a46e9dfc076fa2417eb9b64ccc807/torchao/quantization/weight_tensor_linear_activation_quantization.py#L13 for static quantization
weight = to_linear_activation_quantized_tensor(weight) # dynamic quant
# this will do x / scale
weight = to_weight_tensor_with_linear_activation_scale_metadata(weight)
in dispatch time, we first unwrap the outer most tensor subclass, which will be WeightTensorWithLinearActivationScaleMetadata, so we'll apply scale to activation, then LinearActivationQuantizedTensor, which will quantize activation, and then AffineQuantizedTensor
would this work?
the naming for different tensor subclasses is a bit confusing right now I think, we should cleanup a bit later
I think in this case we should be composing
WeightTensorWithLinearActivationScaleMetadataLinearActivationQuantizedTensortogether, i.e.weight = to_affine_quantized(float_weight, ...) # this will quantize input # use https://github.com/pytorch/ao/blob/c87cc9b7286a46e9dfc076fa2417eb9b64ccc807/torchao/quantization/weight_tensor_linear_activation_quantization.py#L13 for static quantization weight = to_linear_activation_quantized_tensor(weight) # dynamic quant # this will do x / scale weight = to_weight_tensor_with_linear_activation_scale_metadata(weight)in dispatch time, we first unwrap the outer most tensor subclass, which will be
WeightTensorWithLinearActivationScaleMetadata, so we'll apply scale to activation, thenLinearActivationQuantizedTensor, which will quantize activation, and thenAffineQuantizedTensorwould this work?
the naming for different tensor subclasses is a bit confusing right now I think, we should cleanup a bit later
It works. Thanks
Hi @jerryzh168 It's weird that if I add these lines https://github.com/pytorch/ao/blob/f595ed41b99685cc16fc480ca2218965bb812bed/torchao/kernel/intmm.py#L142C1-L146C1 to avoid overflow of float16, there will be a failure in test_spinquant.py, but the test does not use float16 at all. And there is another failure with CUDA nightly, but its log cannot be loaded. And these failures cannot be reproduced in my local environment or on an AWS instance.
So, I have to remove these line and also remove tests for float16. If users try to run in fp16, they will get overflow as well. Do you have any suggestions? Thanks.
Hi @jerryzh168 It's weird that if I add these lines
f595ed4/torchao/kernel/intmm.py#L142C1-L146C1 to avoid overflow of float16, there will be a failure intest_spinquant.py, but the test does not use float16 at all.
what is the test failure? is it possible to do the dtype conversion before calling int_scaled_matmul, e.g. before
y_dot_scaled = int_scaled_matmul(tmp, w_vals_int8_t, x_scales.reshape(-1, 1)) in affine_quantized_tensor.py?
And there is another failure with CUDA nightly, but its log cannot be loaded. And these failures cannot be reproduced in my local environment or on an AWS instance. So, I have to remove these line and also remove tests for float16. If users try to run in fp16, they will get overflow as well. Do you have any suggestions? Thanks.
I just saw the error, it is talking about some triton error:
E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475] assert lhs.shape[1].value >= 32, "small blocks not supported!"
E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475] AssertionError: small blocks not supported!
what is the test failure? is it possible to do the dtype conversion before calling
int_scaled_matmul, e.g. beforey_dot_scaled = int_scaled_matmul(tmp, w_vals_int8_t, x_scales.reshape(-1, 1))in affine_quantized_tensor.py?
The error is results not all close. One element exceed the tolerance by a small amount. As for dtype conversion, I didn't make such changes in affine_quantized_tensor.py 🤔
E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475] assert lhs.shape[1].value >= 32, "small blocks not supported!" E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475] AssertionError: small blocks not supported!
Thanks for the info. Did you see which test case failed?
@jerryzh168 I tried to move the dtype conversion to affine_quantized_tensor.py and it worked. Now all checks pass. Thanks.
Hi @jerryzh168 I have updated this PR. Please take a look again. Thanks
BTW, I found torchao's observer behaves differently from pytorch's observer when running on cuda. torchao's observer has its self.min_val and self.max_val on the same device as input tensor but pytorch's observer always has them on cpu. Is that something that needs a fix? Thanks.
BTW, I found torchao's observer behaves differently from pytorch's observer when running on cuda. torchao's observer has its
self.min_valandself.max_valon the same device as input tensor but pytorch's observer always has them on cpu. Is that something that needs a fix? Thanks.
I see, I feel min_val/max_val being in the same device as input makes more sense? or are you saying we should add an option here?
BTW, I found torchao's observer behaves differently from pytorch's observer when running on cuda. torchao's observer has its
self.min_valandself.max_valon the same device as input tensor but pytorch's observer always has them on cpu. Is that something that needs a fix? Thanks.I see, I feel min_val/max_val being in the same device as input makes more sense? or are you saying we should add an option here?
Oh, I thought you might want them to have the same behavior. It's alright if that is not an issue.