torchchat
torchchat copied to clipboard
linear:int4 issues - RuntimeError: Missing out variants: {'aten::_weight_int4pack_mm'}
(py311) mikekg@mikekg-mbp torchchat % python export.py --checkpoint-path ${MODEL_PATH} --temperature 0 --quantize '{"linear:int4": {"groupsize": 128}}' --output-pte mode.pte
[...]
Traceback (most recent call last):
File "/Users/mikekg/qops/torchchat/export.py", line 111, in <module>
main(args)
File "/Users/mikekg/qops/torchchat/export.py", line 91, in main
export_model_et(
File "/Users/mikekg/qops/torchchat/export_et.py", line 98, in export_model
export_program = edge_manager.to_executorch(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mikekg/miniconda3/envs/py311/lib/python3.11/site-packages/executorch/exir/program/_program.py", line 899, in to_executorch
new_gm_res = p(new_gm)
^^^^^^^^^
File "/Users/mikekg/miniconda3/envs/py311/lib/python3.11/site-packages/torch/fx/passes/infra/pass_base.py", line 40, in __call__
res = self.call(graph_module)
^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/mikekg/miniconda3/envs/py311/lib/python3.11/site-packages/executorch/exir/passes/__init__.py", line 423, in call
raise RuntimeError(f"Missing out variants: {missing_out_vars}")
RuntimeError: Missing out variants: {'aten::_weight_int4pack_mm'}
Current fail is expected -- somewhat anyway after adding the packed call to the _weight_int4pack_mm but documented incorrectly in docs/quantization.md. I think @lucylq most recently updated the specs to streamline them but that glossed over the reality that we have a bit of a swiss cheese situation. That's sad and not pretty to show, but sadly our current reality
I'll try to patch up most execution modes, but we really do need tests. And for performance, maybe the plan should be to hook up _weight_int4pack_mm to an asymmetric version of a8w4dq (as per https://github.com/pytorch/torchchat/issues/541). Of course that's also not quite "correct", but how many modes and operators can we put with how much documentation? FP operators already have a bit of a spread in terms of accruacy based on rounding effects, so maybe that's justifiable...