neural-compressor
neural-compressor copied to clipboard
[RFC] Porting INC SmoothQuant recipes to IPEX autotune API
https://github.com/intel-innersource/frameworks.ai.pytorch.ipex-cpu/issues/2404
Motivation
SmoothQuant is a popular method to improve the accuracy of int8 quantization. Intel-extension-for-pytorch (IPEX) already supports SmoothQuant and provides good optimizations in performance. Intel Neural Compressor (INC) provides finer-grained alpha tuning for the SmoothQuant algorithm, providing greater accuracy for LLMs like Llama2. Integrating this feature into IPEX for good accuracy and performance is a win-win.
Design
Original Interface
import intel_extension_for_pytorch as ipex
# Calibrate the model
qconfig = ipex.quantization.default_static_qconfig
calibrated_model = ipex.quantization.prepare(model_to_be_calibrated, qconfig, example_inputs=example_inputs)
for data in calibration_data_set:
calibrated_model(data)
# Autotune the model
calib_dataloader = torch.utils.data.DataLoader(...)
def eval_func(model):
# Return accuracy value
...
return accuracy
tuned_model = ipex.quantization.autotune(
calibrated_model, calib_dataloader, eval_func,
sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
)
# Convert the model to jit model
quantized_model = ipex.quantization.convert(tuned_model)
with torch.no_grad():
traced_model = torch.jit.trace(quantized_model, example_input)
traced_model = torch.jit.freeze(traced_model)
# Do inference
y = traced_model(x)
New Interface for SmoothQuant
SmoothQuant will introduce alpha to calculate the ratio of input and weight updates to reduce quantization error. Intel Neural Compressor inherits and enhances this functionality, allowing automatic global alpha tuning, and automatic layer-by-layer alpha tuning for the best INT8 accuracy.
| Arguments | Default Value | Available Values | Comments |
|---|---|---|---|
| alpha | 'auto' | [0-1] / 'auto' | A value to balance input and weight quantization error. |
| init_alpha | 0.5 | [0-1] / 'auto' | A value to get baseline quantization error for auto-tuning. |
| alpha_min | 0.0 | [0-1] | min value of auto-tuning alpha search space |
| alpha_max | 1.0 | [0-1] | max value of auto-tuning alpha search space |
| alpha_step | 0.1 | [0-1] | step_size of auto-tuning alpha search space |
| shared_criterion | "mean" | ["min", "mean","max"] | Criterion for input LayerNorm op of a transformer block. |
| enable_blockwise_loss | False | [True, False] | Whether to enable block-wise auto-tuning. |
Proposal 1
The calibration in previous code is redundant in prepare and autotune. Calibration is done before autotune but will be called again by autotune with calib_dataloader. So here we propose to simplify the code and save computing resources and time for users as shown below and do compatible changes to the original design.
Impact:
- Few developer efforts and able to target IPEX version 2.2.
- Compatible changes to the original design.
import intel_extension_for_pytorch as ipex
# Set the tune space of SmoothQuant
smoothquant_args={
"alpha": "auto",
"auto_alpha_args"{
"init_alpha": 0.5,
"alpha_min": 0.0,
"alpha_max": 1.0,
"alpha_step": 0.1,
"shared_criterion": "max",
"enable_blockwise_loss": False,
}
}
int8_tuned_model = ipex.quantization.autotune(
model, calib_dataloader, eval_func, smoothquant_args=smoothquant_args,
sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
)
Proposal 2 (Not recommended)
This proposal follows the previous design but it's not ready now. The main reason is that IPEX prepared model cannot do jit trace in INC internal SmoothQuant [JIRA] prepared model cannot do jit trace. INC uses jit trace to get the relationships of operations to detect which operations share the same input. So this proposal is not recommended.
Impact:
- INC SmoothQuant designs to accept an eager model, but not an IPEX prepared model.
- Dependency: [JIRA] prepared model cannot do jit trace
- Quite a bit of development work to make this proposal work, cannot target IPEX version 2.2.
import intel_extension_for_pytorch as ipex
# Calibrate the model
qconfig = ipex.quantization.get_smooth_quant_qconfig_mapping()
calibrated_model = ipex.quantization.prepare(model_to_be_calibrated, qconfig, example_inputs=example_inputs)
for data in calibration_data_set:
calibrated_model(data)
# Autotune the model
calib_dataloader = torch.utils.data.DataLoader(...)
def eval_func(model):
# Return accuracy value
...
return accuracy
# Set the tune space of SmoothQuant
smoothquant_args={
"alpha": "auto",
"auto_alpha_args"{
"init_alpha": 0.5,
"alpha_min": 0.0,
"alpha_max": 1.0,
"alpha_step": 0.1,
"shared_criterion": "max",
"enable_blockwise_loss": False,
}
}
tuned_model = ipex.quantization.autotune(
calibrated_model, calib_dataloader, eval_func, smoothquant_args=smoothquant_args
sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
)
# Convert the model to jit model
quantized_model = ipex.quantization.convert(tuned_model)
with torch.no_grad():
traced_model = torch.jit.trace(quantized_model, example_input)
traced_model = torch.jit.freeze(traced_model)
# Do inference
y = traced_model(x)
Proposal 3 (Combination of Proposals 1 & 2) (Final choice)
The proposal partially follows the previous design but removes preparation and calibration. This design eliminates redundant calibration and skips the block issue after preparing.
Impact:
- Few developer efforts and able to target IPEX version 2.2.
import intel_extension_for_pytorch as ipex
# Set the tune space of SmoothQuant
smoothquant_args={
"alpha": "auto",
"auto_alpha_args"{
"init_alpha": 0.5,
"alpha_min": 0.0,
"alpha_max": 1.0,
"alpha_step": 0.1,
"shared_criterion": "max",
"enable_blockwise_loss": False,
}
}
tuned_model = ipex.quantization.autotune(
model, calib_dataloader, eval_func, smoothquant_args=smoothquant_args,
sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
)
# Convert the model to jit model
quantized_model = ipex.quantization.convert(tuned_model)
with torch.no_grad():
traced_model = torch.jit.trace(quantized_model, example_input)
traced_model = torch.jit.freeze(traced_model)
# Do inference
y = traced_model(x)
Commonly used smoothquant_args settings
auto global alpha tuning
smoothquant_args={
"alpha": numpy.arange(0.0, 1.0, 0.1).tolist(),
}
auto layer-wise alpha tuning
smoothquant_args={
"alpha": "auto",
"auto_alpha_args"{
"init_alpha": 0.8,
"alpha_min": 0.8,
"alpha_max": 0.99,
"alpha_step": 0.01,
"shared_criterion": "mean",
"enable_blockwise_loss": False,
}
}
After meeting synchronization, we decided to take option3 to ensure the flexibility of post-processing after automatic tuning.