cuDLA-samples
cuDLA-samples copied to clipboard
why we should use apply_custom_rules_to_quantizer?
In quantize.py I find the following function. And it is used in qat.py. why should we find the quantizer_pairs? why should we set: major = bottleneck.cv1.conv._input_quantizer bottleneck.addop._input0_quantizer = major bottleneck.addop._input1_quantizer = major
`
def apply_custom_rules_to_quantizer(model : torch.nn.Module, export_onnx : Callable):
# apply rules to graph
export_onnx(model, "quantization-custom-rules-temp.onnx")
pairs = find_quantizer_pairs("quantization-custom-rules-temp.onnx")
print(pairs)
for major, sub in pairs:
print(f"Rules: {sub} match to {major}")
get_attr_with_path(model, sub)._input_quantizer = get_attr_with_path(model, major)._input_quantizer # why use the same input_quantizer??
os.remove("quantization-custom-rules-temp.onnx")
for name, bottleneck in model.named_modules():
if bottleneck.__class__.__name__ == "Bottleneck":
if bottleneck.add:
print(f"Rules: {name}.add match to {name}.cv1")
major = bottleneck.cv1.conv._input_quantizer
bottleneck.addop._input0_quantizer = major
bottleneck.addop._input1_quantizer = major
`
Thanks.
If we use https://github.com/NVIDIA-AI-IOT/cuDLA-samples/tree/main/export#option1, the generated model can also run on the GPU. However, If the Q&DQ nodes of these tensors are inconsistent, there are a lot of useless int8->fp16 and fp16->int8 data convert in our QAT model. This will slow down the model inference speed.