Torchfx gptq pattern replacer
Required for GPTQ int4 LLMs. (ChatGLM2, ChatGLM3, Llama2) Includes the transformations below:
- GPTQ Decompression pattern replacer
- GPTQ Multiplication pattern replacer
Do we know a reason why the patterns look significantly different from what we have on ts side if the same model is used as an input? @mvafin, do we have tests for GPTQ models enabled for ts path and do they still pass with the recent versions of everything?
We are patching the model in ts path with our implementation of MatMul with quantization, and this can be a reason that we have different patterns. But we are patching because the source model cannot be traced. @cavusmustafa, as you are not patching the model on the fx path but still have some patterns that entirely represent area of the interest in the graph, that means that the initial model can be successfully traced in fx and patching is not required for tracing. @cavusmustafa, have you tried ts path for the same model with disabled pathing inside ts decoder as we discussed some time ago? If the model doesn't need to be patched and can be traced in ts, then we can apply the same approach implemented in this PR for ts path as well and remove GPTQ patching from ts decoder. @mvafin, could you help @cavusmustafa check that?
@slyalin Yes, we have GPTQ model in tests for TS path: https://github.com/openvinotoolkit/openvino/blob/a1de1908bfd5f597d48422f21796daa3bfd08120/tests/model_hub_tests/pytorch/test_hf_transformers.py#L566 It is in precommit, so it definitely works still.
Do we know a reason why the patterns look significantly different from what we have on ts side if the same model is used as an input? @mvafin, do we have tests for GPTQ models enabled for ts path and do they still pass with the recent versions of everything?
We are patching the model in ts path with our implementation of MatMul with quantization, and this can be a reason that we have different patterns. But we are patching because the source model cannot be traced. @cavusmustafa, as you are not patching the model on the fx path but still have some patterns that entirely represent area of the interest in the graph, that means that the initial model can be successfully traced in fx and patching is not required for tracing. @cavusmustafa, have you tried ts path for the same model with disabled pathing inside ts decoder as we discussed some time ago? If the model doesn't need to be patched and can be traced in ts, then we can apply the same approach implemented in this PR for ts path as well and remove GPTQ patching from ts decoder. @mvafin, could you help @cavusmustafa check that?
I didn't need to patch the model as the decompression pattern was already a part of TorchFX graph. So, the pattern shown in the first drawing above is already provided as part of the graph. For TS backend, I did a quick test by removing patching from TS decoder and comparing the performance. I did not observe any performance difference but deeper analysis should be done for sure. But should we need to clarify TS side of the issue for this PR, or can it be part of another PR if any fix is needed?
build_jenkins
build_jenkins
build_jenkins
Please add label for code freeze
build_jenkins
build_jenkins
@cavusmustafa Could you add test for GPTQ model? Can be a GHA test, but not in scope of this PR.
Sure, I will add tests in a later PR