mace
mace copied to clipboard
Benchmark results
A benchmark results of a previous version is available here:
model_name device_name soc abi runtime init warmup run_avg tuned
mobilenet_v2 polaris sdm845 armeabi-v7a GPU 42.868 11.087 9.908 True
mobilenet_v2 MI MAX msm8952 armeabi-v7a GPU 122.791 43.038 39.875 True
mobilenet_v2 BKL-AL00 kirin970 armeabi-v7a GPU 767.932 1226.373 47.597 True
mobilenet_v2 polaris sdm845 arm64-v8a GPU 42.3 10.737 10.004 True
mobilenet_v2 MI MAX msm8952 arm64-v8a GPU 129.123 42.584 39.552 True
mobilenet_v2 BKL-AL00 kirin970 arm64-v8a GPU 753.43 1170.291 48.016 True
mobilenet_v2 polaris sdm845 armeabi-v7a CPU 16.035 69.761 41.627 False
mobilenet_v2 MI MAX msm8952 armeabi-v7a CPU 31.319 86.206 67.586 False
mobilenet_v2 BKL-AL00 kirin970 armeabi-v7a CPU 22.521 137.963 132.012 False
mobilenet_v2 polaris sdm845 arm64-v8a CPU 10.641 80.509 31.985 False
mobilenet_v2 MI MAX msm8952 arm64-v8a CPU 32.225 86.345 54.7 False
mobilenet_v2 BKL-AL00 kirin970 arm64-v8a CPU 20.208 97.295 93.987 False
deeplab_v3_plus_mobilenet_v2 polaris sdm845 armeabi-v7a GPU 56.512 129.422 128.976 True
deeplab_v3_plus_mobilenet_v2 MI MAX msm8952 armeabi-v7a GPU 145.582 899.824 896.452 True
deeplab_v3_plus_mobilenet_v2 BKL-AL00 kirin970 armeabi-v7a GPU 771.122 2096.33 651.999 True
deeplab_v3_plus_mobilenet_v2 polaris sdm845 armeabi-v7a CPU 34.084 951.812 932.764 False
deeplab_v3_plus_mobilenet_v2 MI MAX msm8952 armeabi-v7a CPU 91.383 1543.423 1628.255 False
deeplab_v3_plus_mobilenet_v2 BKL-AL00 kirin970 armeabi-v7a CPU 67.022 2885.098 2872.558 False
deeplab_v3_plus_mobilenet_v2 polaris sdm845 arm64-v8a CPU 29.376 656.16 614.679 False
deeplab_v3_plus_mobilenet_v2 MI MAX msm8952 arm64-v8a CPU 99.986 1170.636 1469.199 False
deeplab_v3_plus_mobilenet_v2 BKL-AL00 kirin970 arm64-v8a CPU 55.476 1796.491 1793.253 False
mobilenet_v1 polaris sdm845 armeabi-v7a GPU 45.551 13.858 13.544 True
mobilenet_v1 MI MAX msm8952 armeabi-v7a GPU 114.037 65.088 61.603 True
mobilenet_v1 BKL-AL00 kirin970 armeabi-v7a GPU 734.51 1211.078 49.318 True
mobilenet_v1 polaris sdm845 arm64-v8a GPU 45.378 13.689 12.826 True
mobilenet_v1 MI MAX msm8952 arm64-v8a GPU 110.526 64.566 61.696 True
mobilenet_v1 BKL-AL00 kirin970 arm64-v8a GPU 730.271 1135.675 48.124 True
mobilenet_v1 polaris sdm845 armeabi-v7a CPU 6.874 79.032 49.676 False
mobilenet_v1 MI MAX msm8952 armeabi-v7a CPU 18.332 121.923 88.207 False
mobilenet_v1 BKL-AL00 kirin970 armeabi-v7a CPU 13.0 172.239 164.469 False
mobilenet_v1 polaris sdm845 arm64-v8a CPU 11.347 90.748 32.888 False
mobilenet_v1 MI MAX msm8952 arm64-v8a CPU 18.358 113.023 71.16 False
mobilenet_v1 BKL-AL00 kirin970 arm64-v8a CPU 11.666 111.706 107.818 False
resnet_v2_50 polaris sdm845 armeabi-v7a GPU 124.229 95.537 93.047 True
resnet_v2_50 MI MAX msm8952 armeabi-v7a GPU 280.575 637.789 636.295 True
resnet_v2_50 BKL-AL00 kirin970 armeabi-v7a GPU 747.875 1596.039 450.651 True
resnet_v2_50 polaris sdm845 armeabi-v7a CPU 18.57 556.961 394.792 False
resnet_v2_50 MI MAX msm8952 armeabi-v7a CPU 44.175 1240.632 734.156 False
resnet_v2_50 BKL-AL00 kirin970 armeabi-v7a CPU 26.034 2505.979 1284.285 False
resnet_v2_50 polaris sdm845 arm64-v8a CPU 17.241 438.925 261.949 False
resnet_v2_50 MI MAX msm8952 arm64-v8a CPU 48.691 1143.032 566.313 False
resnet_v2_50 BKL-AL00 kirin970 arm64-v8a CPU 23.979 2169.373 499.587 False
vgg16 polaris sdm845 armeabi-v7a CPU 15.537 924.855 438.6 False
vgg16 MI MAX msm8952 armeabi-v7a CPU 40.055 2926.202 800.783 False
vgg16 BKL-AL00 kirin970 armeabi-v7a CPU 21.732 2514.862 1242.532 False
vgg16 polaris sdm845 arm64-v8a CPU 12.837 786.419 332.642 False
vgg16 MI MAX msm8952 arm64-v8a CPU 40.693 2794.225 666.285 False
vgg16 BKL-AL00 kirin970 arm64-v8a CPU 20.855 2581.558 1043.35 False
vgg16 polaris sdm845 armeabi-v7a GPU 679.21 128.214 125.523 True
vgg16 MI MAX msm8952 armeabi-v7a GPU 1527.823 806.779 761.073 True
vgg16 BKL-AL00 kirin970 armeabi-v7a GPU 1893.529 2551.389 1042.256 True
inception_v3_dsp polaris sdm845 armeabi-v7a HEXAGON 585.899 77.921 38.875 False
inception_v3 polaris sdm845 armeabi-v7a CPU 19.726 631.444 481.732 False
inception_v3 MI MAX msm8952 armeabi-v7a CPU 47.674 958.758 839.108 False
inception_v3 BKL-AL00 kirin970 armeabi-v7a CPU 29.131 760.945 1194.063 False
inception_v3 polaris sdm845 arm64-v8a CPU 22.251 578.611 425.145 False
inception_v3 MI MAX msm8952 arm64-v8a CPU 50.948 888.531 761.826 False
inception_v3 BKL-AL00 kirin970 arm64-v8a CPU 27.106 668.552 789.08 False
inception_v3 polaris sdm845 armeabi-v7a GPU 101.199 92.578 91.602 True
inception_v3 MI MAX msm8952 armeabi-v7a GPU 257.311 588.829 586.779 True
inception_v3 BKL-AL00 kirin970 armeabi-v7a GPU 770.779 1621.834 436.877 True
squeezenet_v1_1 polaris sdm845 armeabi-v7a GPU 33.615 10.905 10.971 True
squeezenet_v1_1 MI MAX msm8952 armeabi-v7a GPU 83.183 47.273 44.548 True
squeezenet_v1_1 BKL-AL00 kirin970 armeabi-v7a GPU 268.714 437.084 39.404 True
squeezenet_v1_0 polaris sdm845 armeabi-v7a GPU 45.145 16.719 15.0 True
squeezenet_v1_0 MI MAX msm8952 armeabi-v7a GPU 98.571 76.282 72.081 True
squeezenet_v1_0 BKL-AL00 kirin970 armeabi-v7a GPU 403.515 1165.101 63.392 True
squeezenet_v1_0 polaris sdm845 armeabi-v7a CPU 7.393 94.284 60.057 False
squeezenet_v1_0 MI MAX msm8952 armeabi-v7a CPU 27.664 171.195 110.325 False
squeezenet_v1_0 BKL-AL00 kirin970 armeabi-v7a CPU 14.84 169.715 93.174 False
squeezenet_v1_0 polaris sdm845 arm64-v8a CPU 11.9 117.696 49.342 False
squeezenet_v1_0 MI MAX msm8952 arm64-v8a CPU 27.554 170.987 95.552 False
squeezenet_v1_0 BKL-AL00 kirin970 arm64-v8a CPU 13.76 121.544 79.353 False
squeezenet_v1_1 polaris sdm845 arm64-v8a CPU 9.583 61.783 25.376 False
squeezenet_v1_1 MI MAX msm8952 arm64-v8a CPU 21.424 98.661 53.031 False
squeezenet_v1_1 BKL-AL00 kirin970 arm64-v8a CPU 11.005 67.381 41.086 False
More recent results will be available in the gitlab mirror project CI page soon.
A dedicated mobile device deep learning framework benchmark project MobileAIBench is available here: https://github.com/XiaoMi/mobile-ai-bench
:+1:
The daily benchmark results is available here:
- https://gitlab.com/llhe/mace-models/pipelines
- 2018/06/29 https://gitlab.com/llhe/mace-models/-/jobs/78152526
I really appreciate the results, but I am curious as to why the outcome of dsp is only available on inception_v3 ?
@DiamonJoy The benchmark is actually CI result of MACE Model Zoo project. Util now, our efforts is mainly focused on float data type and CPU/GPU runtime, and have not enough time to adding more quantized models into MACE Model Zoo. Quantization (CPU or DSP) support and adding more models into MACE Model Zoo is in our roadmap .
@llhe Amazing results! Can you explain a little more about the "tuned" column?
Tuned means the OpenCL kernel is tuned for the specific type of device instead of using the general rule.
Is this tuning process done manually offline or it is done at run time automatically? If I understand correctly, is it mainly the work group size tuning?
@robertwgh In our original use case, we deploy each model against a specific device (usually a new product), so we wish it's be ultimately optimized by brute forcely search against a list of workgroup options. However, for general application developer, they usually want to generate a library which applies to all devices.
It's offline now. We may consider improve the general rule or enable online increasing tuning in the future.
Or incorporate more advanced rule like ML based models is also a potential choice.
Yeah, that will be interesting. It would be extremely challenging given a large variety of the Android devices and SoC chipsets. Look forward to seeing the results. 👍
From the code I found the CPU benchmark use OpenMP default thread num, it should be 2 threads. Can you confirm the CPU benchmark thread numbers?
@izp001
The CPU benchmark thread number is equal to the number of big core
of CPU.
It seems like CPU mode is much faster then GPU mode.
What's the reason make us use the GPU on android since it can not accelerate ?
@ligonzheng Only for some low end SoCs, CPU is faster than GPU. Usually GPU is faster or even much faster than CPU mode. And there are other benefits including power efficiency, multi tasking (when using GPU, CPU can be used for other computations like image processing algorithms).
Some other question about using the mace : 1, What's the opencl_binary_file ? I can not find the opencl libraries in the builds directory. Could I pass a null when using GPU ? 2, What's KVStorageFactory ? KV means kernel verbose ? 3, does mace support reading proto file from memory ? As we know, it's not convenient to use the model file by pass a path on android and sometime we also don't hope to include the mode inside code.
Thank you for your reply !
@ligonzheng
- please read document.
- KVStorage is used for store built OpenCL binaries for speeding up the initalization and first-run.
- We support convert the model to C++ code.
Happy to find out about this project, and thanks for sharing benchmark results! Wondering where would your results lie on the ReQuEST scoreboard?
Specifically for MobileNets v1/v2, are you using the baseline models (224-1.0
)? It would be cool to add MACE to the ReQuEST MobileNets workflow and visualize results like below:
This is awesome! Can you clarify what the "init" column represents in the benchmark?
@psyhtest ReQuEST scoreboards looks great, we'll take some time to investigate how to make the integration.
@madhavajay The "init" is the framework (engine) initialization time. In some devices, this process could be slow so we make stats for it.
Oh right, and is run average in milliseconds per frame?
Yes, it's milliseconds per inference.
amazing thanks!
@llhe does this work on any Raspberry Pi or TinkerBoard chips? I would love to see these performance numbers on those devices as well.
Hi, I have a question about init and warmup time, when I test the deeplab_v3_plus_mobilenet_v2 on Mi Note3, I got longer init and warmup time than yours. Do you know the reason? the mace version is: v0.8.1-140-gda931bf-20180717
The bechmark I got is as below: armeabi-v7a | CPU | 48.913 | 1176.336 armeabi-v7a | GPU | 1331.048 | 2036.622 arm64-v8a | CPU | 88.716 | 983.319 arm64-v8a | GPU | 1430.722 | 1923.675
The offical benchmark is as below:
armeabi-v7a | CPU | 42.923 | 1184.366 armeabi-v7a | GPU | 76.65 | 517.194 arm64-v8a | CPU | 38.455 | 961.978 arm64-v8a | GPU | 76.591 | 516.302
@madhavajay Currently cross compiling is only supported for Android (NDK). There is an unofficial fork which support build for cross compiling to Linux.
@raninbowlalala Is it the first run of OpenCL job after a reboot? A know issue is that on some Adreno devices, the first run is quite slow. We don't have any special settings for our test devices.
@llhe We may be able to help with CK integration for a couple of programs just to get started. Would you be interested in that?
We have already added a CK package for MACE.
@llhe I benchmarked the models on the OnePlus 3T platform, The performance of quantized models are worse than the float modes. It is normal?
model_name | device_name | soc | abi | runtime | MACE | SNPE | NCNN | TFLITE |
---|---|---|---|---|---|---|---|---|
InceptionV3 | ONEPLUS A3010 | msm8996 | arm64-v8a | CPU | 884.654 | 488.97 | 1616.671 | 730.468 |
InceptionV3 | ONEPLUS A3010 | msm8996 | arm64-v8a | DSP | 5.682 | |||
InceptionV3 | ONEPLUS A3010 | msm8996 | arm64-v8a | GPU | 153.473 | 144.353 | ||
InceptionV3Quant | ONEPLUS A3010 | msm8996 | arm64-v8a | CPU | 1014.662 | |||
MobileNetV1 | ONEPLUS A3010 | msm8996 | arm64-v8a | CPU | 52.004 | 702.713 | 43.301 | 101.273 |
MobileNetV1 | ONEPLUS A3010 | msm8996 | arm64-v8a | GPU | 23.833 | 24.228 | ||
MobileNetV1Quant | ONEPLUS A3010 | msm8996 | arm64-v8a | CPU | 36.565 | 143.806 | ||
MobileNetV2 | ONEPLUS A3010 | msm8996 | arm64-v8a | CPU | 40.742 | 415.117 | 29.985 | 56.101 |
MobileNetV2 | ONEPLUS A3010 | msm8996 | arm64-v8a | GPU | 16.403 | 14.566 | ||
MobileNetV2Quant | ONEPLUS A3010 | msm8996 | arm64-v8a | CPU | 28.688 | 294.525 | ||
SqueezeNetV11 | ONEPLUS A3010 | msm8996 | arm64-v8a | CPU | 37.404 | 61.414 | 22.325 | |
SqueezeNetV11 | ONEPLUS A3010 | msm8996 | arm64-v8a | GPU | 20.021 | 14.528 | ||
VGG16 | ONEPLUS A3010 | msm8996 | arm64-v8a | CPU | 455.553 | 1416.414 | 477.352 | |
VGG16 | ONEPLUS A3010 | msm8996 | arm64-v8a | DSP | 137.22 | |||
VGG16 | ONEPLUS A3010 | msm8996 | arm64-v8a | GPU | 208.335 |