openvino Torchfx gptq pattern replacer

Required for GPTQ int4 LLMs. (ChatGLM2, ChatGLM3, Llama2) Includes the transformations below:

GPTQ Decompression pattern replacer

GPTQ Multiplication pattern replacer

Mar 27 '24 04:03 cavusmustafa

Do we know a reason why the patterns look significantly different from what we have on ts side if the same model is used as an input? @mvafin, do we have tests for GPTQ models enabled for ts path and do they still pass with the recent versions of everything?

We are patching the model in ts path with our implementation of MatMul with quantization, and this can be a reason that we have different patterns. But we are patching because the source model cannot be traced. @cavusmustafa, as you are not patching the model on the fx path but still have some patterns that entirely represent area of the interest in the graph, that means that the initial model can be successfully traced in fx and patching is not required for tracing. @cavusmustafa, have you tried ts path for the same model with disabled pathing inside ts decoder as we discussed some time ago? If the model doesn't need to be patched and can be traced in ts, then we can apply the same approach implemented in this PR for ts path as well and remove GPTQ patching from ts decoder. @mvafin, could you help @cavusmustafa check that?

Mar 31 '24 08:03 slyalin

@slyalin Yes, we have GPTQ model in tests for TS path: https://github.com/openvinotoolkit/openvino/blob/a1de1908bfd5f597d48422f21796daa3bfd08120/tests/model_hub_tests/pytorch/test_hf_transformers.py#L566 It is in precommit, so it definitely works still.

Apr 02 '24 13:04 mvafin

Do we know a reason why the patterns look significantly different from what we have on ts side if the same model is used as an input? @mvafin, do we have tests for GPTQ models enabled for ts path and do they still pass with the recent versions of everything?

We are patching the model in ts path with our implementation of MatMul with quantization, and this can be a reason that we have different patterns. But we are patching because the source model cannot be traced. @cavusmustafa, as you are not patching the model on the fx path but still have some patterns that entirely represent area of the interest in the graph, that means that the initial model can be successfully traced in fx and patching is not required for tracing. @cavusmustafa, have you tried ts path for the same model with disabled pathing inside ts decoder as we discussed some time ago? If the model doesn't need to be patched and can be traced in ts, then we can apply the same approach implemented in this PR for ts path as well and remove GPTQ patching from ts decoder. @mvafin, could you help @cavusmustafa check that?

I didn't need to patch the model as the decompression pattern was already a part of TorchFX graph. So, the pattern shown in the first drawing above is already provided as part of the graph. For TS backend, I did a quick test by removing patching from TS decoder and comparing the performance. I did not observe any performance difference but deeper analysis should be done for sure. But should we need to clarify TS side of the issue for this PR, or can it be part of another PR if any fix is needed?

Apr 25 '24 01:04 cavusmustafa

build_jenkins

May 09 '24 15:05 suryasidd

build_jenkins

May 20 '24 17:05 suryasidd

build_jenkins

May 20 '24 18:05 suryasidd

Please add label for code freeze

May 21 '24 01:05 cavusmustafa

build_jenkins

May 21 '24 04:05 suryasidd

build_jenkins

May 21 '24 21:05 suryasidd

@cavusmustafa Could you add test for GPTQ model? Can be a GHA test, but not in scope of this PR.

Sure, I will add tests in a later PR

May 22 '24 06:05 cavusmustafa