mace Benchmark results

A benchmark results of a previous version is available here:

model_name                     device_name   soc        abi           runtime   init       warmup     run_avg    tuned
mobilenet_v2                   polaris       sdm845     armeabi-v7a   GPU       42.868     11.087     9.908      True
mobilenet_v2                   MI MAX        msm8952    armeabi-v7a   GPU       122.791    43.038     39.875     True
mobilenet_v2                   BKL-AL00      kirin970   armeabi-v7a   GPU       767.932    1226.373   47.597     True
mobilenet_v2                   polaris       sdm845     arm64-v8a     GPU       42.3       10.737     10.004     True
mobilenet_v2                   MI MAX        msm8952    arm64-v8a     GPU       129.123    42.584     39.552     True
mobilenet_v2                   BKL-AL00      kirin970   arm64-v8a     GPU       753.43     1170.291   48.016     True
mobilenet_v2                   polaris       sdm845     armeabi-v7a   CPU       16.035     69.761     41.627     False
mobilenet_v2                   MI MAX        msm8952    armeabi-v7a   CPU       31.319     86.206     67.586     False
mobilenet_v2                   BKL-AL00      kirin970   armeabi-v7a   CPU       22.521     137.963    132.012    False
mobilenet_v2                   polaris       sdm845     arm64-v8a     CPU       10.641     80.509     31.985     False
mobilenet_v2                   MI MAX        msm8952    arm64-v8a     CPU       32.225     86.345     54.7       False
mobilenet_v2                   BKL-AL00      kirin970   arm64-v8a     CPU       20.208     97.295     93.987     False
deeplab_v3_plus_mobilenet_v2   polaris       sdm845     armeabi-v7a   GPU       56.512     129.422    128.976    True
deeplab_v3_plus_mobilenet_v2   MI MAX        msm8952    armeabi-v7a   GPU       145.582    899.824    896.452    True
deeplab_v3_plus_mobilenet_v2   BKL-AL00      kirin970   armeabi-v7a   GPU       771.122    2096.33    651.999    True
deeplab_v3_plus_mobilenet_v2   polaris       sdm845     armeabi-v7a   CPU       34.084     951.812    932.764    False
deeplab_v3_plus_mobilenet_v2   MI MAX        msm8952    armeabi-v7a   CPU       91.383     1543.423   1628.255   False
deeplab_v3_plus_mobilenet_v2   BKL-AL00      kirin970   armeabi-v7a   CPU       67.022     2885.098   2872.558   False
deeplab_v3_plus_mobilenet_v2   polaris       sdm845     arm64-v8a     CPU       29.376     656.16     614.679    False
deeplab_v3_plus_mobilenet_v2   MI MAX        msm8952    arm64-v8a     CPU       99.986     1170.636   1469.199   False
deeplab_v3_plus_mobilenet_v2   BKL-AL00      kirin970   arm64-v8a     CPU       55.476     1796.491   1793.253   False
mobilenet_v1                   polaris       sdm845     armeabi-v7a   GPU       45.551     13.858     13.544     True
mobilenet_v1                   MI MAX        msm8952    armeabi-v7a   GPU       114.037    65.088     61.603     True
mobilenet_v1                   BKL-AL00      kirin970   armeabi-v7a   GPU       734.51     1211.078   49.318     True
mobilenet_v1                   polaris       sdm845     arm64-v8a     GPU       45.378     13.689     12.826     True
mobilenet_v1                   MI MAX        msm8952    arm64-v8a     GPU       110.526    64.566     61.696     True
mobilenet_v1                   BKL-AL00      kirin970   arm64-v8a     GPU       730.271    1135.675   48.124     True
mobilenet_v1                   polaris       sdm845     armeabi-v7a   CPU       6.874      79.032     49.676     False
mobilenet_v1                   MI MAX        msm8952    armeabi-v7a   CPU       18.332     121.923    88.207     False
mobilenet_v1                   BKL-AL00      kirin970   armeabi-v7a   CPU       13.0       172.239    164.469    False
mobilenet_v1                   polaris       sdm845     arm64-v8a     CPU       11.347     90.748     32.888     False
mobilenet_v1                   MI MAX        msm8952    arm64-v8a     CPU       18.358     113.023    71.16      False
mobilenet_v1                   BKL-AL00      kirin970   arm64-v8a     CPU       11.666     111.706    107.818    False
resnet_v2_50                   polaris       sdm845     armeabi-v7a   GPU       124.229    95.537     93.047     True
resnet_v2_50                   MI MAX        msm8952    armeabi-v7a   GPU       280.575    637.789    636.295    True
resnet_v2_50                   BKL-AL00      kirin970   armeabi-v7a   GPU       747.875    1596.039   450.651    True
resnet_v2_50                   polaris       sdm845     armeabi-v7a   CPU       18.57      556.961    394.792    False
resnet_v2_50                   MI MAX        msm8952    armeabi-v7a   CPU       44.175     1240.632   734.156    False
resnet_v2_50                   BKL-AL00      kirin970   armeabi-v7a   CPU       26.034     2505.979   1284.285   False
resnet_v2_50                   polaris       sdm845     arm64-v8a     CPU       17.241     438.925    261.949    False
resnet_v2_50                   MI MAX        msm8952    arm64-v8a     CPU       48.691     1143.032   566.313    False
resnet_v2_50                   BKL-AL00      kirin970   arm64-v8a     CPU       23.979     2169.373   499.587    False
vgg16                          polaris       sdm845     armeabi-v7a   CPU       15.537     924.855    438.6      False
vgg16                          MI MAX        msm8952    armeabi-v7a   CPU       40.055     2926.202   800.783    False
vgg16                          BKL-AL00      kirin970   armeabi-v7a   CPU       21.732     2514.862   1242.532   False
vgg16                          polaris       sdm845     arm64-v8a     CPU       12.837     786.419    332.642    False
vgg16                          MI MAX        msm8952    arm64-v8a     CPU       40.693     2794.225   666.285    False
vgg16                          BKL-AL00      kirin970   arm64-v8a     CPU       20.855     2581.558   1043.35    False
vgg16                          polaris       sdm845     armeabi-v7a   GPU       679.21     128.214    125.523    True
vgg16                          MI MAX        msm8952    armeabi-v7a   GPU       1527.823   806.779    761.073    True
vgg16                          BKL-AL00      kirin970   armeabi-v7a   GPU       1893.529   2551.389   1042.256   True
inception_v3_dsp               polaris       sdm845     armeabi-v7a   HEXAGON   585.899    77.921     38.875     False
inception_v3                   polaris       sdm845     armeabi-v7a   CPU       19.726     631.444    481.732    False
inception_v3                   MI MAX        msm8952    armeabi-v7a   CPU       47.674     958.758    839.108    False
inception_v3                   BKL-AL00      kirin970   armeabi-v7a   CPU       29.131     760.945    1194.063   False
inception_v3                   polaris       sdm845     arm64-v8a     CPU       22.251     578.611    425.145    False
inception_v3                   MI MAX        msm8952    arm64-v8a     CPU       50.948     888.531    761.826    False
inception_v3                   BKL-AL00      kirin970   arm64-v8a     CPU       27.106     668.552    789.08     False
inception_v3                   polaris       sdm845     armeabi-v7a   GPU       101.199    92.578     91.602     True
inception_v3                   MI MAX        msm8952    armeabi-v7a   GPU       257.311    588.829    586.779    True
inception_v3                   BKL-AL00      kirin970   armeabi-v7a   GPU       770.779    1621.834   436.877    True
squeezenet_v1_1                polaris       sdm845     armeabi-v7a   GPU       33.615     10.905     10.971     True
squeezenet_v1_1                MI MAX        msm8952    armeabi-v7a   GPU       83.183     47.273     44.548     True
squeezenet_v1_1                BKL-AL00      kirin970   armeabi-v7a   GPU       268.714    437.084    39.404     True
squeezenet_v1_0                polaris       sdm845     armeabi-v7a   GPU       45.145     16.719     15.0       True
squeezenet_v1_0                MI MAX        msm8952    armeabi-v7a   GPU       98.571     76.282     72.081     True
squeezenet_v1_0                BKL-AL00      kirin970   armeabi-v7a   GPU       403.515    1165.101   63.392     True
squeezenet_v1_0                polaris       sdm845     armeabi-v7a   CPU       7.393      94.284     60.057     False
squeezenet_v1_0                MI MAX        msm8952    armeabi-v7a   CPU       27.664     171.195    110.325    False
squeezenet_v1_0                BKL-AL00      kirin970   armeabi-v7a   CPU       14.84      169.715    93.174     False
squeezenet_v1_0                polaris       sdm845     arm64-v8a     CPU       11.9       117.696    49.342     False
squeezenet_v1_0                MI MAX        msm8952    arm64-v8a     CPU       27.554     170.987    95.552     False
squeezenet_v1_0                BKL-AL00      kirin970   arm64-v8a     CPU       13.76      121.544    79.353     False
squeezenet_v1_1                polaris       sdm845     arm64-v8a     CPU       9.583      61.783     25.376     False
squeezenet_v1_1                MI MAX        msm8952    arm64-v8a     CPU       21.424     98.661     53.031     False
squeezenet_v1_1                BKL-AL00      kirin970   arm64-v8a     CPU       11.005     67.381     41.086     False

More recent results will be available in the gitlab mirror project CI page soon.

A dedicated mobile device deep learning framework benchmark project MobileAIBench is available here: https://github.com/XiaoMi/mobile-ai-bench

Jun 28 '18 08:06 llhe

:+1:

Jun 28 '18 09:06 songruoningbupt

The daily benchmark results is available here:

https://gitlab.com/llhe/mace-models/pipelines
2018/06/29 https://gitlab.com/llhe/mace-models/-/jobs/78152526

Jun 29 '18 01:06 llhe

I really appreciate the results, but I am curious as to why the outcome of dsp is only available on inception_v3 ?

Jun 29 '18 02:06 DiamonJoy

@DiamonJoy The benchmark is actually CI result of MACE Model Zoo project. Util now, our efforts is mainly focused on float data type and CPU/GPU runtime, and have not enough time to adding more quantized models into MACE Model Zoo. Quantization (CPU or DSP) support and adding more models into MACE Model Zoo is in our roadmap .

Jun 29 '18 02:06 llhe

@llhe Amazing results! Can you explain a little more about the "tuned" column?

Jun 29 '18 05:06 robertwgh

Tuned means the OpenCL kernel is tuned for the specific type of device instead of using the general rule.

Jun 29 '18 06:06 llhe

Is this tuning process done manually offline or it is done at run time automatically? If I understand correctly, is it mainly the work group size tuning?

Jun 29 '18 06:06 robertwgh

@robertwgh In our original use case, we deploy each model against a specific device (usually a new product), so we wish it's be ultimately optimized by brute forcely search against a list of workgroup options. However, for general application developer, they usually want to generate a library which applies to all devices.

It's offline now. We may consider improve the general rule or enable online increasing tuning in the future.

Jun 29 '18 06:06 llhe

Or incorporate more advanced rule like ML based models is also a potential choice.

Jun 29 '18 06:06 llhe

Yeah, that will be interesting. It would be extremely challenging given a large variety of the Android devices and SoC chipsets. Look forward to seeing the results. 👍

Jun 29 '18 06:06 robertwgh

From the code I found the CPU benchmark use OpenMP default thread num, it should be 2 threads. Can you confirm the CPU benchmark thread numbers?

Jun 29 '18 12:06 izp001

@izp001 The CPU benchmark thread number is equal to the number of big core of CPU.

Jul 02 '18 01:07 nolanliou

It seems like CPU mode is much faster then GPU mode.
What's the reason make us use the GPU on android since it can not accelerate ?

Jul 16 '18 09:07 ligonzheng

@ligonzheng Only for some low end SoCs, CPU is faster than GPU. Usually GPU is faster or even much faster than CPU mode. And there are other benefits including power efficiency, multi tasking (when using GPU, CPU can be used for other computations like image processing algorithms).

Jul 17 '18 01:07 llhe

Some other question about using the mace : 1, What's the opencl_binary_file ? I can not find the opencl libraries in the builds directory. Could I pass a null when using GPU ? 2, What's KVStorageFactory ? KV means kernel verbose ? 3, does mace support reading proto file from memory ? As we know, it's not convenient to use the model file by pass a path on android and sometime we also don't hope to include the mode inside code.

Thank you for your reply !

Jul 17 '18 02:07 ligonzheng

@ligonzheng

please read document.
KVStorage is used for store built OpenCL binaries for speeding up the initalization and first-run.
We support convert the model to C++ code.

Jul 17 '18 05:07 nolanliou

Happy to find out about this project, and thanks for sharing benchmark results! Wondering where would your results lie on the ReQuEST scoreboard?

Specifically for MobileNets v1/v2, are you using the baseline models (224-1.0)? It would be cool to add MACE to the ReQuEST MobileNets workflow and visualize results like below:

request

Jul 17 '18 08:07 psyhtest

This is awesome! Can you clarify what the "init" column represents in the benchmark?

Jul 17 '18 19:07 madhavajay

@psyhtest ReQuEST scoreboards looks great, we'll take some time to investigate how to make the integration.

Jul 18 '18 06:07 llhe

@madhavajay The "init" is the framework (engine) initialization time. In some devices, this process could be slow so we make stats for it.

Jul 18 '18 06:07 llhe

Oh right, and is run average in milliseconds per frame?

Jul 19 '18 18:07 madhavajay

Yes, it's milliseconds per inference.

Jul 20 '18 01:07 llhe

amazing thanks!

Jul 21 '18 15:07 madhavajay

@llhe does this work on any Raspberry Pi or TinkerBoard chips? I would love to see these performance numbers on those devices as well.

Jul 21 '18 15:07 madhavajay

Hi, I have a question about init and warmup time, when I test the deeplab_v3_plus_mobilenet_v2 on Mi Note3, I got longer init and warmup time than yours. Do you know the reason? the mace version is: v0.8.1-140-gda931bf-20180717

The bechmark I got is as below: armeabi-v7a | CPU | 48.913 | 1176.336 armeabi-v7a | GPU | 1331.048 | 2036.622 arm64-v8a | CPU | 88.716 | 983.319 arm64-v8a | GPU | 1430.722 | 1923.675

The offical benchmark is as below:

armeabi-v7a | CPU | 42.923 | 1184.366 armeabi-v7a | GPU | 76.65 | 517.194 arm64-v8a | CPU | 38.455 | 961.978 arm64-v8a | GPU | 76.591 | 516.302

Jul 23 '18 06:07 raninbowlalala

@madhavajay Currently cross compiling is only supported for Android (NDK). There is an unofficial fork which support build for cross compiling to Linux.

Jul 23 '18 06:07 llhe

@raninbowlalala Is it the first run of OpenCL job after a reboot? A know issue is that on some Adreno devices, the first run is quite slow. We don't have any special settings for our test devices.

Jul 23 '18 06:07 llhe

@llhe We may be able to help with CK integration for a couple of programs just to get started. Would you be interested in that?

Jul 30 '18 09:07 psyhtest

We have already added a CK package for MACE.

Jul 31 '18 13:07 psyhtest

@llhe I benchmarked the models on the OnePlus 3T platform, The performance of quantized models are worse than the float modes. It is normal?

model_name	device_name	soc	abi	runtime	MACE	SNPE	NCNN	TFLITE
InceptionV3	ONEPLUS A3010	msm8996	arm64-v8a	CPU	884.654	488.97	1616.671	730.468
InceptionV3	ONEPLUS A3010	msm8996	arm64-v8a	DSP		5.682
InceptionV3	ONEPLUS A3010	msm8996	arm64-v8a	GPU	153.473	144.353
InceptionV3Quant	ONEPLUS A3010	msm8996	arm64-v8a	CPU				1014.662
MobileNetV1	ONEPLUS A3010	msm8996	arm64-v8a	CPU	52.004	702.713	43.301	101.273
MobileNetV1	ONEPLUS A3010	msm8996	arm64-v8a	GPU	23.833	24.228
MobileNetV1Quant	ONEPLUS A3010	msm8996	arm64-v8a	CPU	36.565			143.806
MobileNetV2	ONEPLUS A3010	msm8996	arm64-v8a	CPU	40.742	415.117	29.985	56.101
MobileNetV2	ONEPLUS A3010	msm8996	arm64-v8a	GPU	16.403	14.566
MobileNetV2Quant	ONEPLUS A3010	msm8996	arm64-v8a	CPU	28.688			294.525
SqueezeNetV11	ONEPLUS A3010	msm8996	arm64-v8a	CPU	37.404	61.414	22.325
SqueezeNetV11	ONEPLUS A3010	msm8996	arm64-v8a	GPU	20.021	14.528
VGG16	ONEPLUS A3010	msm8996	arm64-v8a	CPU	455.553	1416.414	477.352
VGG16	ONEPLUS A3010	msm8996	arm64-v8a	DSP		137.22
VGG16	ONEPLUS A3010	msm8996	arm64-v8a	GPU	208.335

Nov 16 '18 03:11 liyancas

mace mace copied to clipboard

Benchmark results

mace
mace copied to clipboard