neural-compressor
neural-compressor copied to clipboard
SOTA low-bit LLM quantization (INT8/FP8/INT4/FP4/NF4) & sparsity; leading model compression techniques on TensorFlow, PyTorch, and ONNX Runtime
## Type of Change feature API changed or not ## Description - [x] Support convert unquantized `linear` into `fp16` - [ ] Extend the fp16 ops list to align with...
## Type of Change feature or bug fix or documentation or validation or others API changed or not ## Description detail description ## Expected Behavior & Potential Risk the expected...
## Type of Change UT ## Description detail description ## Expected Behavior & Potential Risk the expected behavior that triggered by this PR ## How has this PR been tested?...
## Type of Change feature ## Description detail description ## Expected Behavior & Potential Risk the expected behavior that triggered by this PR ## How has this PR been tested?...
## Type of Change feature ## Description detail description ## Expected Behavior & Potential Risk the expected behavior that triggered by this PR ## How has this PR been tested?...
## Type of Change feature ## Description - [x] update config params - [x] update `get_autoround_default_run_fn` - [x] update prepare/convert - [x] return paking model - [x] enhance ut -...
## Type of Change bug fix API changed or not: no ## Description Update lm-eval evaluate in ort llm example ## How has this PR been tested? extention test ##...
## Type of Change sq supports calib_func for auto-tune, no need for dataloader ## Description Layer-wise & block-wise enable Add ut check auto-tune Check llm examples ## Expected Behavior &...
## Type of Change 3.x example bug fix ## Description ## Expected Behavior & Potential Risk pass extension test ## How has this PR been tested? ## Dependency Change?