neural-compressor
neural-compressor copied to clipboard
AWQ quantization is very slow for ONNX LLMs
I'm not sure if I'm missing an option somewhere, but AWQ quantization for large ONNX models is very slow. When quantizing a 7B LLaMA model, the 4 following np.matmul calls take forever to execute, and I estimate it would take days to quantize the model at current pace:
https://github.com/intel/neural-compressor/blob/26b260e174cac13b023a11caab372b2dcdc593e0/neural_compressor/adaptor/ox_utils/weight_only.py#L466
https://github.com/intel/neural-compressor/blob/26b260e174cac13b023a11caab372b2dcdc593e0/neural_compressor/adaptor/ox_utils/weight_only.py#L488
https://github.com/intel/neural-compressor/blob/26b260e174cac13b023a11caab372b2dcdc593e0/neural_compressor/adaptor/ox_utils/weight_only.py#L615
https://github.com/intel/neural-compressor/blob/26b260e174cac13b023a11caab372b2dcdc593e0/neural_compressor/adaptor/ox_utils/weight_only.py#L636
Would it make sense to allow the user to pass either a torch module or an ONNX model/session to compute the loss (or at the very least do the matmul computation)? Even replacing the np.matmul calls with simple torch.matmul calls on CUDA devices makes it exponentially faster.
Otherwise, is there a current workaround or option I'm unaware of at the moment that would make it faster? I feel like I might be missing something.
It takes about 1 hour to run AWQ quantization on Llama-2-7b model with our test device using scripts in our llama weight-only quantization example. You can refer to the options for AWQ in main.py#L325-L336.
We currently have no plans to support torch tensor computation in our ONNX weight-only quantization. However, we recommend considering alternative solutions using CuPy instead of numpy for GPU-accelerated computing. You can try to implement this method yourself.