distiller
distiller copied to clipboard
Quantization-aware models increase in size
Hi, I am training a model using quantization-aware training, and I have a couple of questions:
It seems to actually increase the size of the model (from ~87Mb to ~137Mb). I have come across this explanation on https://nervanasystems.github.io/distiller/quantization.html:
A full precision copy of the weights is maintained throughout the training process ("weights_fp" in the diagram). Its purpose is to accumulate the small changes from the gradients without loss of precision (Note that the quantization of the weights is an integral part of the training graph, meaning that we back-propagate through it as well). Once the model is trained, only the quantized weights are used for inference.
Which lead me to think that the checkpoint actually contains two sets of weights, and thus it is heavier. Is there any way to discard the full precision weights once the training is finished? Or am I not understanding something?
I understand that distiller does not support real quantization, and, instead, uses a wrapper around the quantized parts. However, should I export to model to ONNX, would it be executed as a real quantized model?
Thanks in advance!
Apologies for the very late response. To your questions:
- You're correct, both sets of weights are kept. I don't think deleting the set of FP32 weights would work, because when you try to load the checkpoint, even if only for evaluation, when loading the state_dict it will still look for the FP32 weights. That's because they're defined as a
buffer
, which means PyTorch tried to load them. You could probably override that manually in the code if you really wanted to. - I'm not sure if you're referring here to models quantized with quant-aware training, or using post-training quantization.
- For quant-aware training, the short answer is no. The long answer is that only "fake-quantization" is done. That is, we do perform quantization + de-quantization on the data and then the layers execute in their vanilla FP32 form. Even if those fake-quantization operations can be exported to ONNX (haven't tried), you'll still end up with a full FP32 model.
- For post-training quantization, we don't support ONNX export. But, we added the ability to convert a model post-train quantized with Distiller to a "native" PyTorch model, using the native PyTorch quantization APIs. Which will execute as a real quantized model, as you put it (only on CPU for the moment, which is what PyTorch supports). see here for details.
Thanks for your answer. Looking forward for ONNX exporting support!
I'm sorry, I don't understand. As the state_dict is now modified by the addition of various parameters, I cannot load it in my model using the .pth
file that is generated. In such a scenario how do I evaluate the results of my model after QAT?
@xserraalza, how did you evaluate your model after QAT?