armnn uint8 quantized model runs slower than fp32 model

Hi author， I encountered a question while doing inference on cortex-A55 aarch64 with CpuAcc as the backend. There are 2 models , one is fp32 and the other one is uint8 quantized. My tests showed that fp32 ran even faster than the uint8 quantized one. Just curious why this would happen. Please refer to the attachment for the 2 models. In addition, both c++ parser mode and delegate mode have the same issue. Appreciate your suggestions. Thanks. test.zip

Jul 12 '22 07:07 liamsun2019

ReduceFp32ToFp16 is set to True in my tests.

Jul 12 '22 07:07 liamsun2019

Hi @liamsun2019,

I am getting 2 warnings for GATHER and TRANSPOSE running your models with CpuAcc seen in your issue #666. I just want to confirm these are still present for you so I can comment correctly on the results?

Running the models on CpuAcc with the following commands I can confirm the same regression (~245ms vs ~263ms):

./ExecuteNetwork -m u8l.tflite -v -f tflite-binary -c CpuAcc,CpuRef -i X.1 -o 2180 --number-of-threads 1 --iterations 10
./ExecuteNetwork -m fp32.tflite -v -f tflite-binary -c CpuAcc,CpuRef -i input.55 -o 1456 --number-of-threads 1 --iterations 10

From a quick look, I cannot see any operator that runs faster on fp32 compared to uint8 model. The profiling is quite extensive so I will spend some time looking through and come back if I get something.

Kind Regards, Cathal.

Jul 19 '22 08:07 catcor01

f32.txt u8.txt

Jul 19 '22 08:07 catcor01

One thing I have noticed: average pooling (only used once in your model) is not supported in CpuAcc for uint8 and therefore the operation falls back to CpuRef.

Time Cost: ~4000us (CpuRef for uint8) vs ~117us = ~3900us = ~3.9ms. There is a time cost to fall back to CpuRef due to a memory copy before and after the operation but it is negligible compared to above: ~9us and ~5us = max 15us.

@morgolock you might have an idea on if uint8 support for average pooling 2d can be added to compute library (it seems uint8 max pool 2d support it already there). Perhaps it cannot be added because of some kind of padding? Warning message: Warning: WARNING: Layer of type Pooling2d is not supported on requested backend CpuAcc for input data type QAsymmU8 and output data type QAsymmU8 (reason: in validate_arguments src/cpu/kernels/CpuPool2dKernel.cpp:185: exclude_padding equal false is not supported for AVG Pooling with padding on quantized types), falling back to the next backend.

Jul 19 '22 10:07 catcor01

Along with the above the following is what I have discovered:

The uint8 model performs quantize and de-quantize operations (NeonQuantizeWorkload_Execute_#227 being the biggest time cost) which is adding up to approx 3.5-4 ms.
CpuAcc pooling 2d is slower for uint8. Can be up to 1 ms slower.
CpuAcc concat is slower for uint8. Can be up between 1-2 ms slower.
CpuRef gather operation can be twice as slow for the uint8 model (1.25ms vs 0.6ms).

Jul 19 '22 13:07 catcor01

Hello @liamsun2019,

Falling back to CpuRef is very much degrading your performance. Unfortunately, because many of the transpose and gather operations are not supported for CpuAcc, fallback is inevitable. We do not guarantee uint8 performance in CpuRef to be better than fp32 (it will actually more than likely be slower because of how it is implemented in ArmNN) which is why you are seeing worse uint8 performance. However, by using the delegate you can fallback to TfLite runtime and not CpuRef which should have efficient uint8 performance compared to float32. You can do that by running the following:

./ExecuteNetwork -m u8l.tflite -f tflite-binary --tflite-executor delegate -c CpuAcc -i X.1 -o 2180 --number-of-threads 1 --iterations 10

I hope this will improve the performance of your uint8 model.

Kind Regards, Cathal.

Jul 19 '22 15:07 catcor01

I have tried to run your model with the delegate and it fails due to the following error:

Warning: WARNING: Layer of type Pooling2d is not supported on requested backend CpuAcc for input data type QAsymmU8 and output data type QAsymmU8 (reason: in validate_arguments src/cpu/kernels/CpuPool2dKernel.cpp:185: exclude_padding equal false is not supported for AVG Pooling with padding on quantized types), falling back to the next backend.
Warning: ERROR: Layer of type Pooling2d is not supported on any preferred backend [CpuAcc ]
terminate called after throwing an instance of 'armnn::Exception'
  what():  TfLiteArmnnDelegate: Exception (Failed to assign a backend to each layer) caught from optimize.

@SadikARM provided me with the following information of what is happening: " I believe why it is not falling back to the TfLite Runtime is that first IsLayerSupported() return true for Pooling2d layer which means it already delegated the layer to Arm NN then somewhere in the flow (seems like in optimization) CpuPool2dKernel::validate_arguments() called and it throw error. So in optimization level it is too late to fall back to TfLite Runtime because it delegated the graph already to Arm NN. " I will look into this and make a patch.

Jul 19 '22 15:07 catcor01

Hi @catcor01,

Many thanks for your time and so detailed analysis. Instead, I ran these 2 models basing on the sample codes. I made some modifications while building them, e.g, -DUSE_ARMNN_DELEGATE=0/1, to apply delegate or parser to the sample codes. I also noticed that there are many tranpose/gather operations in the model and I think that contributes some overhead to inference time. For delegate mode, I have not encountered the errors you listed. I will spend some time conducting more tests.

Thanks Ｂ.R Liam

Jul 20 '22 02:07 liamsun2019

Hello @liamsun2019,

A patch has been submitted to master (soon to be changed to main) fixing the above failure for CpuAcc. Your model should now be able to fully run using CpuAcc without the above error being thrown.

Kind Regards, Cathal.

Jul 29 '22 12:07 catcor01

Hi @catcor01，

Sorry for the late reply. I have been focusing on some other work recently. I will try this patch ASAP. Thanks for your kindly help.

Aug 04 '22 02:08 liamsun2019

@liamsun2019 could you let us know if this patch has fixed your issue? I will close this ticket otherwise. Thank you very much

Sep 20 '22 14:09 keidav01

Hi @keidav01 ,

There's no progress on my side since my attention has been absorbed by some other things so far. You can just close it and I will verify the patch ASAP. Thanks for your help.

B.R Liam

Sep 21 '22 01:09 liamsun2019

Thank you @liamsun2019, closing

Sep 21 '22 08:09 keidav01

armnn armnn copied to clipboard

uint8 quantized model runs slower than fp32 model

armnn
armnn copied to clipboard