caffe2 [mobile performance master issue]

Why?

Building anything on mobile is challenging, because of the many tricks that one will need to do to make optimized builds. This task aims to be a master tracking point if you encounter mobile build and performance issues.

As you might have expected, at Facebook we use buck to build our code. We know that such build system is not ideal for the open source world. As a result, we worked hard to standardize on CMake as the blessed open source build tool for all platforms (thanks @slayton58 for the suffering!). We tried really hard to make sure that all platforms build correctly, but we are asking you to help us verifying the correctness. Here are a few build and execution check points.

Always Check if you have built things with NEON

You should have -mfpu=neon -mfloat-abi=[softfp|hardfp] in your build tools all the time. If things are slow, definitely double check this.

Check if you have built and used NNPACK

Check if your cmake has USE_NNPACK on in the summary print. If not, you might want to figure out the NNPACK build. The nnpack build cmake script is located here: https://github.com/caffe2/caffe2/blob/master/cmake/External/nnpack.cmake .

For operators such as convolution, you should then set engine to "NNPACK" in your protobuf - we use this on mobile to achieve good performance on CNNs.

For binary size controls, use a whitelist of source files

Due to the modular nature of the library, if you are just running specific models (such as CNNs), you don't need the full set of the source files, especially those under operators. In facebook we explicitly whitelist the files that we build for mobile. On OSS, we added a cmake option CAFFE2_WHITELIST to mirror such mechanism. We haven't tested it extensively on the OSS side, so if you'd like to give it a try and write down your steps for others, that would be really great!

android-cmake and ios-cmake

We use these two cmake plugins to build android and ios librarys. For android, the binaries can be directly used by adb shell. To build your libraries, you should make sure to include all the .a files, and also use whole static library linking.

An example android app build

Check out https://github.com/bwasti/AICamera

Apr 19 '17 08:04 Yangqing

I have built and run the android demo on a LG G5c (a relatively low-end phone in 2015). It takes 5 seconds to process 1 frame, while the official tensorflow demo takes 0.8 seconds. To be fair, tensorflow is running inception net, while the AICamera demo is running squeezeNet, so I may be comparing apple to orange.

How can I verify the above optimazation tips are correctly applied in the android/iOS build? For example, how to verify NNPACK is used? (gradle hides the cmake stderr)

Thanks.

Apr 19 '17 10:04 BrianOn99

Does or will Caffe2 support OpenCL on Android? I was told that original Caffe can be compiled with OpenCL and run on ARMv8 devices (at least it is shown in this online comparison of Caffe vs TensorFlow classification performance for Android devices @ http://cknowledge.org/repo but did need to dig it further if they use platform-specific optimizations; by the way I asked those guys if they plan to add Caffe2 @ https://github.com/dividiti/ck-caffe/issues/102).

Apr 19 '17 13:04 mlosab3

/cc @bwasti for @BrianOn99's question - I think the example app should be able to build nnpack off the shelf, but nnpack was added after the demo app.

I don't think the model currently uses NNPACK yet, @bwasti could you confirm?

Apr 19 '17 19:04 Yangqing

@mlosab3 - Many thanks for bringing this topic to our attention! You are right - thanks to the community effort led by @naibaf7, Caffe1 does support OpenCL.

And we do support building Caffe1 for Android with OpenCL - as well as many other variants - using our disruptive open-source Collective Knowledge technology (CK). The cknowledge.org/repo page currently shows results from our Android app for crowdsourcing DNN optimization across mobile devices with different engines (deep learning frameworks, compute libraries and network models). Our goal is to optimize DNN across the whole SW/HW stack: from network models (e.g. for object classification and detection) all the way to hardware (CPUs, DSPs, GPUs, custom accelerators).

@Yangqing: if this is of any interest, we would be happy to collaborate on CK-Caffe2, bringing many of the same benefits of CK-Caffe to the Caffe2 community.

As a bit of background, we have created CK to manage the complexity of benchmarking and optimization of the ever evolving computing systems (SW+HW) with the ever growing number of optimization choices. Our customers and partners, including General Motors and ARM, use CK to evaluate and optimize emerging workloads such as deep learning across diverse data inputs and HW/SW platforms. The key ingredients of our Collective Knowledge approach are composability, automation, crowd-source-ability, and reproducibility:

Composability allows us to assemble optimized solutions from reusable components akin to playing with LEGO bricks. For example, we can build deep learning frameworks (e.g. Caffe1, TensorFlow), with different compute libraries (e.g. OpenBLAS, cuBLAS, CLBlast), for different operating systems (Linux, Android, Windows).
Automation allows us - with minimal effort - to perform thousands of experiments, aggregate experimental results in private or public repositories, and use predictive analytics to continuously refine the best SW/HW solution (in terms of speed, quality, power consumption, resource usage, cost, etc.) and detect unexpected behaviour (e.g. performance issues).
Crowd-source-ability allows us to distribute experimentation across machines, for example, supercomputer nodes or mobile phones; data inputs (e.g. network models); etc. - helping to quickly uncover non-obvious optimization opportunities.
Reproducibility ensures that we collect experimental data that can be trusted when obtaining valuable technical insights or making critical business decisions.

Again, if this is of interest, we are always happy to discuss.

Apr 20 '17 14:04 psyhtest

Will Caffe2 build a compiler to optimize platform specific performace automatically?

Apr 25 '17 05:04 futurely

@BrianOn99 the demo app is currently in a poor state as the build system is quite dated compared to the current version of caffe2. I believe the -O3 optimization flag isn't even being invoked and I don't think NNPACK is being used.

Apr 26 '17 16:04 bwasti

@bwasti is there a plan to update the demo, or could you give some hints to fix the above problems?

Apr 27 '17 02:04 BrianOn99

Is there any example to use the whitelist options? thanks @bwasti

May 02 '17 05:05 lzx1413

@bwasti Any plans to support the ARM Compute Library? For example, there's such a project for Caffe1. And apparently for TensorFlow.

Jul 11 '17 08:07 psyhtest

caffe2 caffe2 copied to clipboard

[mobile performance master issue]

Why?

Always Check if you have built things with NEON

Check if you have built and used NNPACK

For binary size controls, use a whitelist of source files

android-cmake and ios-cmake

An example android app build

caffe2
caffe2 copied to clipboard