neural-compressor PostTrainingQuantConfig(quant_level='auto', device='npu', backend="onnxrt_dml

PostTrainingQuantConfig(quant_level='auto', device='npu', backend="onnxrt_dml_ep") produces fp32 ops.

Open kleiti opened this issue 1 year ago • 1 comments

The below PostTrainingQuantConfig produces fp32 ops for NPU using 2.4.1. Models with int8 and fp16 ops would be preferred for NPU.

conf=PostTrainingQuantConfig(quant_level='auto', device='npu', backend="onnxrt_dml_ep", quant_format="QOperator", approach="static", excluded_precisions=['bf16'])

Jan 26 '24 11:01 kleiti

Hi @kleiti , onnxrt_dml_ep backend is experimental and currently we only support MatMul int8. We will enhance its functionality later.

Feb 19 '24 07:02 mengniwang95

neural-compressor neural-compressor copied to clipboard

PostTrainingQuantConfig(quant_level='auto', device='npu', backend="onnxrt_dml_ep") produces fp32 ops.

neural-compressor
neural-compressor copied to clipboard