neural-compressor AWQ fails on ONNX model when a MatMul node's input is a model input/initializer

AWQ fails on ONNX model when a MatMul node's input is a model input/initializer

Open jstoecker opened this issue 1 year ago • 1 comments

Hello,

The awq_quantize function collects the names of input tensors to each MatMul node, and later looks up the parent node that produces the named tensor. This assumes the tensors are outputs of nodes in the model, which won't be the case for model inputs or initializers. I noticed this when experimenting with a toy model:

Error message:

2024-01-24 17:25:43 [ERROR] Unexpected exception KeyError('input') happened during tuning.
Traceback (most recent call last):
  File "C:\Users\justoeck\Miniconda3\envs\pytorch2\Lib\site-packages\neural_compressor\quantization.py", line 234, in fit
    strategy.traverse()
  File "C:\Users\justoeck\Miniconda3\envs\pytorch2\Lib\site-packages\neural_compressor\strategy\auto.py", line 140, in traverse
    super().traverse()
  File "C:\Users\justoeck\Miniconda3\envs\pytorch2\Lib\site-packages\neural_compressor\strategy\strategy.py", line 505, in traverse
    q_model = self.adaptor.quantize(copy.deepcopy(tune_cfg), self.model, self.calib_dataloader, self.q_func)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\justoeck\Miniconda3\envs\pytorch2\Lib\site-packages\neural_compressor\utils\utility.py", line 304, in fi
    res = func(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\justoeck\Miniconda3\envs\pytorch2\Lib\site-packages\neural_compressor\adaptor\onnxrt.py", line 1925, in quantize
    tmp_model = awq_quantize(
                ^^^^^^^^^^^^^
  File "C:\Users\justoeck\Miniconda3\envs\pytorch2\Lib\site-packages\neural_compressor\adaptor\ox_utils\weight_only.py", line 783, in awq_quantize
    parent = model.output_name_to_node[input_name]
             ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^
KeyError: 'input'

I can work around this by inserting an identity node between any model inputs and MatMul layers that consume the input tensor directly:

I suspect the following edit to awq_quantize would also work (for model inputs at least, but probably not initializers):

for node in model.nodes():
    if (
        node.op_type in ["MatMul"]
        and weight_config.get(node.name, {}) != "fp32"
        and weight_config.get(node.name, {}).get("algorithm", "AWQ") == "AWQ"
+       and node.input[0] not in model.input()
    ):
        output_names.append(node.input[0])

I considered opening a PR, but I'm not sure what the preferred solution is, plus I see some refactoring for AWQ/GPTQ in the new 3.x API. I'm also unfamiliar with the tests. :)

Jan 25 '24 01:01 jstoecker

Hi @jstoecker, thanks for raising this issue, and your enhancements are very welcome! As the 3.x API is still under development and subject to change, I suggest you fix it based on the master branch and ask the ORT owner(@mengniwang95 @yuwenzho ) to review the PR.

Jan 27 '24 15:01 yiliu30

neural-compressor neural-compressor copied to clipboard

AWQ fails on ONNX model when a MatMul node's input is a model input/initializer

neural-compressor
neural-compressor copied to clipboard