ao icon indicating copy to clipboard operation
ao copied to clipboard

[WIP] SmoothQuant using tensor subclassing

Open Xia-Weiwen opened this issue 1 year ago • 2 comments

Still WIP

The implementation of SmoothQuant with tensor subclassing (AffineQuantizedTensor) is similar to that of AWQ with the following differences:

  • SmoothQuant supports both static and dynamic quantization of activation while AWQ only uses dynamic
  • The smoothing factor is calculated differently from the equalization scales of AWQ

Xia-Weiwen avatar Oct 08 '24 00:10 Xia-Weiwen

:link: Helpful Links

:test_tube: See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1030

Note: Links to docs will display an error until the docs builds have been completed.

:white_check_mark: No Failures

As of commit cb9167a880286c8f07c2771fc1d3109ea5988ee8 with merge base d4b2f334ee4e8b3b1ba75569b80ddba8bdf8fd6a (image): :green_heart: Looks good so far! There are no failures yet. :green_heart:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

pytorch-bot[bot] avatar Oct 08 '24 00:10 pytorch-bot[bot]

Hi @jerryzh168 I added a new tensor subclass LinearActivationScaleQuantizedTensor to support x -> x/scale -> quantize x for torch.compile.

If I use LinearActivationQuantizedTensor, the x/sacle is done outside the class (by input_quant_func) and there is a dynamo error about scale during torch.compile. I guess it's because the scale tensors are not on the graph in this case. Putting the scale in the weight tensor solves the problem. And WeightTensorWithLinearActivationScaleMetadata does not quantize activation.

Do you have any concern adding this new class? Thanks.

Xia-Weiwen avatar Oct 14 '24 03:10 Xia-Weiwen

Hi @jerryzh168 I added a new tensor subclass LinearActivationScaleQuantizedTensor to support x -> x/scale -> quantize x for torch.compile.

If I use LinearActivationQuantizedTensor, the x/sacle is done outside the class (by input_quant_func) and there is a dynamo error about scale during torch.compile. I guess it's because the scale tensors are not on the graph in this case. Putting the scale in the weight tensor solves the problem. And WeightTensorWithLinearActivationScaleMetadata does not quantize activation.

Do you have any concern adding this new class? Thanks.

I think in this case we should be composing WeightTensorWithLinearActivationScaleMetadata LinearActivationQuantizedTensor together, i.e.

weight = to_affine_quantized(float_weight, ...)
# this will quantize input
# use https://github.com/pytorch/ao/blob/c87cc9b7286a46e9dfc076fa2417eb9b64ccc807/torchao/quantization/weight_tensor_linear_activation_quantization.py#L13 for static quantization
weight = to_linear_activation_quantized_tensor(weight)  # dynamic quant
# this will do x / scale
weight = to_weight_tensor_with_linear_activation_scale_metadata(weight)

in dispatch time, we first unwrap the outer most tensor subclass, which will be WeightTensorWithLinearActivationScaleMetadata, so we'll apply scale to activation, then LinearActivationQuantizedTensor, which will quantize activation, and then AffineQuantizedTensor

would this work?

the naming for different tensor subclasses is a bit confusing right now I think, we should cleanup a bit later

jerryzh168 avatar Oct 16 '24 21:10 jerryzh168

I think in this case we should be composing WeightTensorWithLinearActivationScaleMetadata LinearActivationQuantizedTensor together, i.e.

weight = to_affine_quantized(float_weight, ...)
# this will quantize input
# use https://github.com/pytorch/ao/blob/c87cc9b7286a46e9dfc076fa2417eb9b64ccc807/torchao/quantization/weight_tensor_linear_activation_quantization.py#L13 for static quantization
weight = to_linear_activation_quantized_tensor(weight)  # dynamic quant
# this will do x / scale
weight = to_weight_tensor_with_linear_activation_scale_metadata(weight)

in dispatch time, we first unwrap the outer most tensor subclass, which will be WeightTensorWithLinearActivationScaleMetadata, so we'll apply scale to activation, then LinearActivationQuantizedTensor, which will quantize activation, and then AffineQuantizedTensor

would this work?

the naming for different tensor subclasses is a bit confusing right now I think, we should cleanup a bit later

It works. Thanks

Xia-Weiwen avatar Oct 17 '24 13:10 Xia-Weiwen

Hi @jerryzh168 It's weird that if I add these lines https://github.com/pytorch/ao/blob/f595ed41b99685cc16fc480ca2218965bb812bed/torchao/kernel/intmm.py#L142C1-L146C1 to avoid overflow of float16, there will be a failure in test_spinquant.py, but the test does not use float16 at all. And there is another failure with CUDA nightly, but its log cannot be loaded. And these failures cannot be reproduced in my local environment or on an AWS instance. So, I have to remove these line and also remove tests for float16. If users try to run in fp16, they will get overflow as well. Do you have any suggestions? Thanks.

Xia-Weiwen avatar Oct 18 '24 06:10 Xia-Weiwen

Hi @jerryzh168 It's weird that if I add these lines f595ed4/torchao/kernel/intmm.py#L142C1-L146C1 to avoid overflow of float16, there will be a failure in test_spinquant.py, but the test does not use float16 at all.

what is the test failure? is it possible to do the dtype conversion before calling int_scaled_matmul, e.g. before y_dot_scaled = int_scaled_matmul(tmp, w_vals_int8_t, x_scales.reshape(-1, 1)) in affine_quantized_tensor.py?

And there is another failure with CUDA nightly, but its log cannot be loaded. And these failures cannot be reproduced in my local environment or on an AWS instance. So, I have to remove these line and also remove tests for float16. If users try to run in fp16, they will get overflow as well. Do you have any suggestions? Thanks.

I just saw the error, it is talking about some triton error:


  E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475]     assert lhs.shape[1].value >= 32, "small blocks not supported!"
  E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475] AssertionError: small blocks not supported!

jerryzh168 avatar Oct 18 '24 17:10 jerryzh168

what is the test failure? is it possible to do the dtype conversion before calling int_scaled_matmul, e.g. before y_dot_scaled = int_scaled_matmul(tmp, w_vals_int8_t, x_scales.reshape(-1, 1)) in affine_quantized_tensor.py?

The error is results not all close. One element exceed the tolerance by a small amount. As for dtype conversion, I didn't make such changes in affine_quantized_tensor.py 🤔


  E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475]     assert lhs.shape[1].value >= 32, "small blocks not supported!"
  E1018 18:45:40.023529 436 site-packages/torch/_inductor/runtime/triton_heuristics.py:475] AssertionError: small blocks not supported!

Thanks for the info. Did you see which test case failed?

Xia-Weiwen avatar Oct 19 '24 07:10 Xia-Weiwen

@jerryzh168 I tried to move the dtype conversion to affine_quantized_tensor.py and it worked. Now all checks pass. Thanks.

Xia-Weiwen avatar Oct 21 '24 10:10 Xia-Weiwen

Hi @jerryzh168 I have updated this PR. Please take a look again. Thanks

Xia-Weiwen avatar Oct 23 '24 07:10 Xia-Weiwen

BTW, I found torchao's observer behaves differently from pytorch's observer when running on cuda. torchao's observer has its self.min_val and self.max_val on the same device as input tensor but pytorch's observer always has them on cpu. Is that something that needs a fix? Thanks.

Xia-Weiwen avatar Oct 23 '24 08:10 Xia-Weiwen

BTW, I found torchao's observer behaves differently from pytorch's observer when running on cuda. torchao's observer has its self.min_val and self.max_val on the same device as input tensor but pytorch's observer always has them on cpu. Is that something that needs a fix? Thanks.

I see, I feel min_val/max_val being in the same device as input makes more sense? or are you saying we should add an option here?

jerryzh168 avatar Oct 23 '24 21:10 jerryzh168

BTW, I found torchao's observer behaves differently from pytorch's observer when running on cuda. torchao's observer has its self.min_val and self.max_val on the same device as input tensor but pytorch's observer always has them on cpu. Is that something that needs a fix? Thanks.

I see, I feel min_val/max_val being in the same device as input makes more sense? or are you saying we should add an option here?

Oh, I thought you might want them to have the same behavior. It's alright if that is not an issue.

Xia-Weiwen avatar Oct 24 '24 01:10 Xia-Weiwen