neural-compressor
neural-compressor copied to clipboard
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
The below PostTrainingQuantConfig produces fp32 ops for NPU using 2.4.1. Models with int8 and fp16 ops would be preferred for NPU. conf=PostTrainingQuantConfig(quant_level='auto', device='npu', backend="onnxrt_dml_ep", quant_format="QOperator", approach="static", excluded_precisions=['bf16'])
I'm not sure if I'm missing an option somewhere, but AWQ quantization for large ONNX models is very slow. When quantizing a 7B LLaMA model, the 4 following `np.matmul` calls...
Hi team, I am having issue quantizing the network consisting of Conv and Linear layers using **int8** weights and activations in ONNX. I have tried setting it using op_type_dict, however...
Hi all, I'm attempting to follow the SmoothQuant tutorial for the LLAMA2-7b model: [https://github.com/intel/neural-compressor/tree/master/examples/onnxrt/nlp/huggingface_model/text_generation/llama/quantization/ptq_static] System configuration: OS : WINDOWS 11 Python: Python 3.10.11 My steps: 1. CREATE PROJECT FOLDERr: neural-compressor-tutorial...
hi, I want to write scripts to print layer_mappings for distillation, my script like this: `for name, module in model.named_modules(): print(name)` while the results is far away from default layer_mapping....
Hello, The [awq_quantize](https://github.com/intel/neural-compressor/blob/42c2def02e128818f19d8342052ab0544e9623f7/neural_compressor/adaptor/ox_utils/weight_only.py#L703) function [collects the names of input tensors to each MatMul node](https://github.com/intel/neural-compressor/blob/42c2def02e128818f19d8342052ab0544e9623f7/neural_compressor/adaptor/ox_utils/weight_only.py#L758-L764), and later [looks up the parent node that produces the named tensor](https://github.com/intel/neural-compressor/blob/42c2def02e128818f19d8342052ab0544e9623f7/neural_compressor/adaptor/ox_utils/weight_only.py#L783). This assumes the tensors...
Dear all, In order to easily use Intel Neural Compressor in our team, and because we use PyTorch Lightning, I am building Lightning Callbacks in order to call your hooks...
Hi Team, I have converted a norma t5 small model to Onnx using onnxruntime 1.15.1, python =3.10.12 in Intel Processor and AMD processor but received different response! Please let me...
see https://github.com/pytorch/tutorials/issues/2690, looks like there's an issue with the tutorial at https://pytorch.org/tutorials/recipes/intel_neural_compressor_for_pytorch.html, has an issue with the neural compressor which is causing a seg fault. looks like a contributor @ftian1...