neural-compressor
neural-compressor copied to clipboard
PostTrainingQuantConfig(quant_level='auto', device='npu', backend="onnxrt_dml_ep") produces fp32 ops.
The below PostTrainingQuantConfig produces fp32 ops for NPU using 2.4.1. Models with int8 and fp16 ops would be preferred for NPU.
conf=PostTrainingQuantConfig(quant_level='auto', device='npu', backend="onnxrt_dml_ep", quant_format="QOperator", approach="static", excluded_precisions=['bf16'])
Hi @kleiti , onnxrt_dml_ep backend is experimental and currently we only support MatMul int8. We will enhance its functionality later.