neural-compressor AWQ quantization is very slow for ONNX LLMs

AWQ quantization is very slow for ONNX LLMs

Open PatriceVignola opened this issue 1 year ago • 1 comments

I'm not sure if I'm missing an option somewhere, but AWQ quantization for large ONNX models is very slow. When quantizing a 7B LLaMA model, the 4 following np.matmul calls take forever to execute, and I estimate it would take days to quantize the model at current pace:

https://github.com/intel/neural-compressor/blob/26b260e174cac13b023a11caab372b2dcdc593e0/neural_compressor/adaptor/ox_utils/weight_only.py#L466

https://github.com/intel/neural-compressor/blob/26b260e174cac13b023a11caab372b2dcdc593e0/neural_compressor/adaptor/ox_utils/weight_only.py#L488

https://github.com/intel/neural-compressor/blob/26b260e174cac13b023a11caab372b2dcdc593e0/neural_compressor/adaptor/ox_utils/weight_only.py#L615

https://github.com/intel/neural-compressor/blob/26b260e174cac13b023a11caab372b2dcdc593e0/neural_compressor/adaptor/ox_utils/weight_only.py#L636

Would it make sense to allow the user to pass either a torch module or an ONNX model/session to compute the loss (or at the very least do the matmul computation)? Even replacing the np.matmul calls with simple torch.matmul calls on CUDA devices makes it exponentially faster.

Otherwise, is there a current workaround or option I'm unaware of at the moment that would make it faster? I feel like I might be missing something.

Feb 10 '24 09:02 PatriceVignola

It takes about 1 hour to run AWQ quantization on Llama-2-7b model with our test device using scripts in our llama weight-only quantization example. You can refer to the options for AWQ in main.py#L325-L336.

We currently have no plans to support torch tensor computation in our ONNX weight-only quantization. However, we recommend considering alternative solutions using CuPy instead of numpy for GPU-accelerated computing. You can try to implement this method yourself.

Feb 19 '24 07:02 yuwenzho

neural-compressor neural-compressor copied to clipboard

AWQ quantization is very slow for ONNX LLMs

neural-compressor
neural-compressor copied to clipboard