wasm-micro-runtime icon indicating copy to clipboard operation
wasm-micro-runtime copied to clipboard

WASI-NN should not apply input quantization

Open CIPop opened this issue 2 years ago • 8 comments

Currently, the TFLite wasi-nn implementation performs quantization if quantization scale and zero-point exist (https://github.com/bytecodealliance/wasm-micro-runtime/blob/main/core/iwasm/libraries/wasi-nn/src/wasi_nn_tensorflowlite.cpp#L323)

This results in poor performance with ssd_mobilenet_v1_1_metadata_1.tflite Direct download link.


The SSD mobilenet v1.1 model has the following input details:

import numpy as np
import tensorflow as tf
i = tf.lite.Interpreter(model_path="ssd_mobilenet_v1_1_metadata_1.tflite")
i.allocate_tensors()
input_details = i.get_input_details()[0]
input_details
{'name': 'normalized_input_image_tensor',
 'index': 175,
 'shape': array([  1, 300, 300,   3], dtype=int32),
 'shape_signature': array([  1, 300, 300,   3], dtype=int32),
 'dtype': numpy.uint8,               <--------------------------------------------------------
 'quantization': (0.0078125, 128),
 'quantization_parameters': {'scales': array([0.0078125], dtype=float32),
  'zero_points': array([128], dtype=int32),
  'quantized_dimension': 0},
 'sparsity_parameters': {}}

The model works well without the RGB input (300x300x3 uint8_t) being quantized. (See my bug at https://github.com/joonb14/TFLiteDetection/issues/1 for a full Jupyter Notebook example.) When I try to apply quantization (in either python or by running the input through wasi-nn) I get very poor results.

To work-around this issue, I had to apply an inverse function when creating the input tensor:

// Taken from the model's input_details:
#define QUANTIZATION_SCALE 0.007812
#define QUANTIZATION_ZERO_POINT 128.0

// in create_input(...)
    
    for (int i = 0; i < input.elements; ++i)
    {
        input.input_tensor[i] = data[i];
        // WAMR / wasi-nn bug. Model does not expect quantized data but it is done internally regardless:
        // Reversing the internal WAMR quantization:      it[i] = (uint8_t)(input_tensor_f[i] / scale + zero_point);
        input.input_tensor[i] = (input.input_tensor[i] - QUANTIZATION_ZERO_POINT) * QUANTIZATION_SCALE;
    }

    return input;
}

With above workaround, I get the exact same (good) results in both Python and when running with iwasm (wasi-nn enabled).

I'm confused by https://www.tensorflow.org/lite/performance/post_training_integer_quant#run_the_tensorflow_lite_models which states that if input_details['dtype'] == np.uint8: quantization should be applied to the input (what wasi-nn does)...

CIPop avatar Sep 29 '23 23:09 CIPop

Hi, if I visualize the model with netron I get the following, issue-pr-netron As you can see, the quantization section of the input indicates that the original distribution of your data is [-1, 1) (float/double). As the model has been trained with that range of values it is waiting for values between [-1, 1] so that the inferencer can quantize it. That is, convert from [-1, 1) to [0, 255] (uint8).

-1 * (1/0.0078125) + 128 = 0 0.9921875 * (1/0.0078125) + 128 = 255

In the Python inferencer the quantization is not done automatically. Therefore, it expects the user to do it. As you comment in https://github.com/joonb14/TFLiteDetection/issues/1, you are right, the processor lacks the preprocessing of the input tensor to quantize it. However, since instead of loading the images with the range of values with which the model has been trained, you have loaded the images in uint8 the quantization is implicit, by coincidence.

On the other hand, in wasi-nn the quantization is done internally and, therefore, expects the original range of values. In your solution you have transformed [0, 255], which is a data distribution that does not correspond to the one the model has been trained on, to [-1, 1), which is the valid one. In this way, the model will be able to perform the transformation to the correct range of values by itself.

Note that wasi-nn expects values in the range with which it has been trained. Any other assumption is (in most cases) wrong, since the distribution of the data when making the inference should be the same as the training (there are always exceptions).

tonibofarull avatar Sep 30 '23 00:09 tonibofarull

Note also that if we wanted the users themselves to quantize the values we have 2 options:

  1. We need a way for wasi-nn to pass the scale and offset information to them, which is not possible at this time.
  2. The user knows the values in advance and writes them in code.

tonibofarull avatar Sep 30 '23 00:09 tonibofarull

@tonibofarull, I just skimmed this issue but perhaps the information that you're looking to pass on could be done if we added a new metadata feature to wasi-nn? I've been looking for examples where this would be useful. Would the metadata need to be attached to the tensor or the graph or the context?

abrown avatar Oct 02 '23 21:10 abrown

The metadata is already in the model, at least in the case of TFLite. The problem reported by @CIPop is that the input range expected by wasi-nn is the one used for training, which from my point of view is correct, instead of directly asking for the quantized version. In this case, the quantization turned out to be uint8 as well as the image format, but it could have been uint16 or any other, so those images would have to be scaled no matter what.

Perhaps what we can do is allow users to decide whether to quantize manually or let the runtime assume the input is that of the training.

tonibofarull avatar Oct 02 '23 21:10 tonibofarull

@tonibofarull I just verified that with the expected [-1..1] input range, WASI-NN performs as expected. Thank you for the in-depth explanation!

In WAMR's WASI-NN wasi_nn.h we could add documentation

The pre-processing should be:

  1. Obtain input - coincidentally, uint8_t `300x300x3, values [0..255]
  2. Transform input to float 300x300x3, values [-1..1]
  3. Based on the model input type (uint8_t), apply quantization parameters. This transforms the input back to uint8_t 300x300x3, values [0..255].

The important part is that, while the tensors in steps 1 and 3 have the same shape and type, the values are clearly different.

Tested in Python:

res_im = im.resize((300, 300))
np_res_im = np.array(res_im)

# Transform from input RGB [0..255] to [-1, 1]
np_res_im = (np_res_im / 255) * 2 - 1

# From https://www.tensorflow.org/lite/performance/post_training_integer_quant#run_the_tensorflow_lite_models
# Check if the input type is quantized, then rescale input data to uint8
if input_details['dtype'] == np.uint8:
    input_scale, input_zero_point = input_details["quantization"]
    np_res_im = np_res_im / input_scale + input_zero_point
 
np_res_im = np.expand_dims(np_res_im, axis=0).astype(input_details["dtype"])

# Quantized input [0..255].
print(np_res_im)

Tested in WASI-NN / C:

    for (int i = 0; i < input.elements; ++i)
    {
        // WASM-NN expects non-quantized RGB data (-1..1)
        input.input_tensor[i] = ((float)data[i] / 255) * 2 - 1;
    }

Given @tonibofarull's explanation, the official TFLite quantization documentation, I am now convinced this isn't a WASI-NN / TFlite implementation bug.

This explanation is a bit ambiguous:

Lets assume the expected image is 300x300 pixels, with three channels (red, blue, and green) per pixel. This should be fed to the model as a flattened buffer of 270,000 byte values (300x300x3). If the model is quantized, each value should be a single byte representing a value between 0 and 255.

The second sentence is true only if the model is indeed quantized. I would expect that non-quantized models would accept a flattened buffer of 270000 float values.

Feel free to close unless you'd like to keep open to add the extra meta-information API that allows external quantization.

CIPop avatar Oct 02 '23 23:10 CIPop

Note that wasi-nn expects values in the range with which it has been trained. Any other assumption is (in most cases) wrong, since the distribution of the data when making the inference should be the same as the training (there are always exceptions).

why? i expect a user to provide whatever model expects. in this case, quantized input. after all, the same user loaded the model. he should know what it takes.

the implicit quantization in question seems like a bug to me.

yamt avatar Jun 13 '25 02:06 yamt

We need a way for wasi-nn to pass the scale and offset information to them, which is not possible at this time.

do you mean something like tflite's interpreter.get_input_details? i agree it would be convenient.

yamt avatar Jun 13 '25 02:06 yamt

https://github.com/bytecodealliance/wasm-micro-runtime/pull/4517 disabled the quantization handling logic for wasi_ephemeral_nn because it causes incompatibility for certain applications and other wasm runtimes.

yamt avatar Aug 28 '25 05:08 yamt