Saving/Loading Fake Quant weights
I'm encounting some issues on saving/loading fake quant.
I'm trying to save with an save_fake option and load it again to check.
How can I load fake quant model in llmc?
It looks that setting a model_path in config loads a model with AutoModelForCausalLM.from_pretrained and doesn't load with EffcientFakeQuantLinear (simulate layer).
[Disclaimer: I'm not an author in this repo, just an active user]
First, mind the difference in treating quantized weights and activations.
If you're interested in quantizing only the model weights (keeping the activations in full precision), you can use the save_trans option. In this case, you won't need the fake-quantization wrappers, anyway, as the model will be saved / loaded with the modified (quantized weights).
In the case you want to quantize the activations, as well, the EffcientFakeQuantLinear class is a required "fake quantization" wrapper. Adding the wrapper to the linear layers of the model practically changes its architecture, and it can no longer be represented by the pretrained HuggingFace classes, e.g., LlamaForCausalLM.
Therefore, to the best of my understanding, saving_ it via save_fake does not really preserve the knowledge of its quantization parameters.
I currently see two workarounds:
- Save and load the model checkpoint using
torch.save()andtorch.load(). This is not an elegant option, but an easy one, and I'm using it. - Save a state-dict of the quantized model (that includes the quantization parameters determined during LLMC run), and after loading it, "re-deploy" quantization wrappers. This may be a more complicated option, but perhaps the existing
deploy()function can assist.
Regarding the latter, I'm not 100% sure if the existing code already supports this, and will be glad if the repo authors can comment.