https://github.com/intel-innersource/frameworks.ai.pytorch.ipex-cpu/issues/2404

Dec 27 '23 06:12 xin3he

Motivation

SmoothQuant is a popular method to improve the accuracy of int8 quantization. Intel-extension-for-pytorch (IPEX) already supports SmoothQuant and provides good optimizations in performance. Intel Neural Compressor (INC) provides finer-grained alpha tuning for the SmoothQuant algorithm, providing greater accuracy for LLMs like Llama2. Integrating this feature into IPEX for good accuracy and performance is a win-win.

Design

Original Interface

import intel_extension_for_pytorch as ipex
# Calibrate the model
qconfig = ipex.quantization.default_static_qconfig
calibrated_model = ipex.quantization.prepare(model_to_be_calibrated, qconfig, example_inputs=example_inputs)
for data in calibration_data_set:
    calibrated_model(data)
# Autotune the model
calib_dataloader = torch.utils.data.DataLoader(...)
def eval_func(model):
    # Return accuracy value
    ...
    return accuracy
tuned_model = ipex.quantization.autotune(
                calibrated_model, calib_dataloader, eval_func,
                sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
            )
# Convert the model to jit model
quantized_model = ipex.quantization.convert(tuned_model)
with torch.no_grad():
    traced_model = torch.jit.trace(quantized_model, example_input)
    traced_model = torch.jit.freeze(traced_model)
# Do inference
y = traced_model(x)

New Interface for SmoothQuant

SmoothQuant will introduce alpha to calculate the ratio of input and weight updates to reduce quantization error. Intel Neural Compressor inherits and enhances this functionality, allowing automatic global alpha tuning, and automatic layer-by-layer alpha tuning for the best INT8 accuracy.

Arguments	Default Value	Available Values	Comments
alpha	'auto'	[0-1] / 'auto'	A value to balance input and weight quantization error.
init_alpha	0.5	[0-1] / 'auto'	A value to get baseline quantization error for auto-tuning.
alpha_min	0.0	[0-1]	min value of auto-tuning alpha search space
alpha_max	1.0	[0-1]	max value of auto-tuning alpha search space
alpha_step	0.1	[0-1]	step_size of auto-tuning alpha search space
shared_criterion	"mean"	["min", "mean","max"]	Criterion for input LayerNorm op of a transformer block.
enable_blockwise_loss	False	[True, False]	Whether to enable block-wise auto-tuning.

Proposal 1

The calibration in previous code is redundant in prepare and autotune. Calibration is done before autotune but will be called again by autotune with calib_dataloader. So here we propose to simplify the code and save computing resources and time for users as shown below and do compatible changes to the original design.

Impact:

Few developer efforts and able to target IPEX version 2.2.
Compatible changes to the original design.

import intel_extension_for_pytorch as ipex
# Set the tune space of SmoothQuant
smoothquant_args={
    "alpha": "auto",
    "auto_alpha_args"{
        "init_alpha": 0.5,
        "alpha_min": 0.0,
        "alpha_max": 1.0,
        "alpha_step": 0.1,
        "shared_criterion": "max",
        "enable_blockwise_loss": False,
    }
}
int8_tuned_model = ipex.quantization.autotune(
    model, calib_dataloader, eval_func, smoothquant_args=smoothquant_args,
    sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
)

Proposal 2 (Not recommended)

This proposal follows the previous design but it's not ready now. The main reason is that IPEX prepared model cannot do jit trace in INC internal SmoothQuant [JIRA] prepared model cannot do jit trace. INC uses jit trace to get the relationships of operations to detect which operations share the same input. So this proposal is not recommended.

Impact:

INC SmoothQuant designs to accept an eager model, but not an IPEX prepared model.
Dependency: [JIRA] prepared model cannot do jit trace
Quite a bit of development work to make this proposal work, cannot target IPEX version 2.2.

import intel_extension_for_pytorch as ipex
# Calibrate the model
qconfig = ipex.quantization.get_smooth_quant_qconfig_mapping()
calibrated_model = ipex.quantization.prepare(model_to_be_calibrated, qconfig, example_inputs=example_inputs)
for data in calibration_data_set:
    calibrated_model(data)
# Autotune the model
calib_dataloader = torch.utils.data.DataLoader(...)
def eval_func(model):
    # Return accuracy value
    ...
    return accuracy

# Set the tune space of SmoothQuant
smoothquant_args={
    "alpha": "auto",
    "auto_alpha_args"{
        "init_alpha": 0.5,
        "alpha_min": 0.0,
        "alpha_max": 1.0,
        "alpha_step": 0.1,
        "shared_criterion": "max",
        "enable_blockwise_loss": False,
    }
}
tuned_model = ipex.quantization.autotune(
    calibrated_model, calib_dataloader, eval_func, smoothquant_args=smoothquant_args
    sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
)
# Convert the model to jit model
quantized_model = ipex.quantization.convert(tuned_model)
with torch.no_grad():
    traced_model = torch.jit.trace(quantized_model, example_input)
    traced_model = torch.jit.freeze(traced_model)
# Do inference
y = traced_model(x)

Proposal 3 (Combination of Proposals 1 & 2) (Final choice)

The proposal partially follows the previous design but removes preparation and calibration. This design eliminates redundant calibration and skips the block issue after preparing.

Impact:

Few developer efforts and able to target IPEX version 2.2.

import intel_extension_for_pytorch as ipex
# Set the tune space of SmoothQuant
smoothquant_args={
    "alpha": "auto",
    "auto_alpha_args"{
        "init_alpha": 0.5,
        "alpha_min": 0.0,
        "alpha_max": 1.0,
        "alpha_step": 0.1,
        "shared_criterion": "max",
        "enable_blockwise_loss": False,
    }
}
tuned_model = ipex.quantization.autotune(
    model, calib_dataloader, eval_func, smoothquant_args=smoothquant_args,
    sampling_sizes=[100], accuracy_criterion={'relative': 0.01}, tuning_time=0
)
# Convert the model to jit model
quantized_model = ipex.quantization.convert(tuned_model)
with torch.no_grad():
    traced_model = torch.jit.trace(quantized_model, example_input)
    traced_model = torch.jit.freeze(traced_model)
# Do inference
y = traced_model(x)

Commonly used smoothquant_args settings

auto global alpha tuning

smoothquant_args={
    "alpha": numpy.arange(0.0, 1.0, 0.1).tolist(),
}

auto layer-wise alpha tuning

smoothquant_args={
    "alpha": "auto",
    "auto_alpha_args"{
        "init_alpha": 0.8,
        "alpha_min": 0.8,
        "alpha_max": 0.99,
        "alpha_step": 0.01,
        "shared_criterion": "mean",
        "enable_blockwise_loss": False,
    }
}

Jan 03 '24 02:01 xin3he

After meeting synchronization, we decided to take option3 to ensure the flexibility of post-processing after automatic tuning.

Jan 03 '24 03:01 xin3he

neural-compressor
neural-compressor copied to clipboard

[RFC] Porting INC SmoothQuant recipes to IPEX autotune API

Motivation

Design

Original Interface

New Interface for SmoothQuant

Proposal 1

Proposal 2 (Not recommended)

Proposal 3 (Combination of Proposals 1 & 2) (Final choice)

Commonly used smoothquant_args settings

auto global alpha tuning

auto layer-wise alpha tuning

neural-compressor neural-compressor copied to clipboard

[RFC] Porting INC SmoothQuant recipes to IPEX autotune API

Motivation

Design

Original Interface

New Interface for SmoothQuant

Proposal 1

Proposal 2 (Not recommended)

Proposal 3 (Combination of Proposals 1 & 2) (Final choice)

Commonly used smoothquant_args settings

auto global alpha tuning

auto layer-wise alpha tuning

neural-compressor
neural-compressor copied to clipboard