DeepSpeech Investigate GPUs on ARM boards

We have ARM and ARM64 (#1305) arriving on RPi3B+ and LePotato boards. We should try and see if we can get GPU acceleration there:

OpenCL can work not too bad on Intel GPUs (cf ccpp) using ComputeCpp
LePotato has a MALI-450 GPU, no idea yet how much we can expect of OpenCL there
RPi3B (and + model) now have OpenCL 1.2 Embedded Profile compliant https://github.com/doe300/VC4CL#opencl-support and ComputeCpp has ARMv7 binaries targetting Ubuntu 14.04, so we might be able to get stuff to work there.

Apr 18 '18 13:04 lissyx

The Rock64Pro-AI has a mali-t860, I would consider this option.

Apr 29 '18 05:04 edubois

@edubois You're welcome to explore that, but I'll stick to what I have right now :)

Apr 29 '18 09:04 lissyx

Will do,

Apr 29 '18 16:04 edubois

You state ARM boards but then mention Intel GPUs. Therefore, I would offer to do some testing on the UP board which is powered by a 4-core Atom x5-Z8350, has 4 GB of RAM and incorporates a Cherry Trail HD graphics with 12 execution units. We are using that, since it is compatible with the Matrix Voice, our hardware for sound capturing.

We are also looking at the Intel Movidius Neural Compute Stick, which is compatible with Tensor Flow and the Raspberry Pi 3, since the combination would be more cost-effective.

The Rock64Pro-AI looks interesting as well and would be even cheaper. I'm curious on the results of @edubois

Apr 30 '18 09:04 renepeinl

@renepeinl I have actually been experimenting for quite some time on Intel GPUs on my laptop, debugging and checking performances (thanks to CodePlay people and Intel people), so I know we can get it working with the "Neo" driver, which is far from being released yet, sadly. The Compute Stick is useless in our case, because of RNNs. The previous driver, Beignet, was a dead-end: not working with ComputeCpp (layer used by TensorFlow for OpenCL), and not being actively developped anymore by intel.

People who want to experiment should use the ccpp branch of our TensorFlow and DeepSpeech repo, but be aware it's hack in progress :)

Apr 30 '18 09:04 lissyx

Thanks for these information. Could you provide some hints on compiler flags for building the software as well? I'm not sure how much influence they have and we are mainly Java developers with no deeper knowledge about C++.

Apr 30 '18 09:04 renepeinl

@renepeinl it should be pretty simple, if you follow the docs we have in place and TensorFlow's building doc. For ComputeCpp branch, you'll need to download ComputeCpp matching version. Since it's hack in progress, I have not documented that, but you can look the tc-*.sh shell scripts in our TensorFlow's repo, it should contain everything. Basically, Bazel v0.10.0, ComputeCpp 0.5.1 (I think?) and proper ./configure flags (check tc-vars.sh mostly). On the DeepSpeech build side, it should not change OpenCL or not.

Apr 30 '18 09:04 lissyx

@renepeinl If you run into issues, you can join us on IRC (#machinelearning on irc.mozilla.org) or on Discourse: https://discourse.mozilla.org/c/deep-speech

Apr 30 '18 09:04 lissyx

@renepeinl, the chip is not yet available, will start when I get one, probably in Sept.

Apr 30 '18 13:04 edubois

Good first milestone on RPi3:

I have llvm-spirv cross-built
I have been able to cross-compile vc4c, vc4clstdlib and vc4cl bits
Debian packages properly installs on Raspbian
ComputeCpp 0.7.0 for Ubuntu 14.04 / ARM32 shows VC4 GPU
Currently running the VC4C testsuite, there are failures, but there are successes as well, meaning the basics of the infra is there and working

May 18 '18 15:05 lissyx

$ sudo ComputeCpp-CE-0.7.0-Ubuntu-14.04-ARM_32/bin/computecpp_info --verbose --use-spirv 
********************************************************************************

ComputeCpp Info (CE 0.7.0)

********************************************************************************

Toolchain information:

GLIBC version: 2.24
GLIBCXX: 20150426
This version of libstdc++ is supported.

********************************************************************************


Device Info:

Discovered 1 devices matching:
  platform    : <any>
  device type : <any>

--------------------------------------------------------------------------------
Device 0:

  Device is supported                     : NO - Device does not support SPIR
  CL_DEVICE_NAME                          : VideoCore IV GPU
  CL_DEVICE_VENDOR                        : Broadcom
  CL_DRIVER_VERSION                       : 0.4
  CL_DEVICE_TYPE                          : CL_DEVICE_TYPE_GPU 
  CL_DEVICE_VERSION                       : OpenCL 1.2 VC4CL 0.4
  CL_DEVICE_PROFILE                       : EMBEDDED_PROFILE
  CL_DEVICE_MAX_COMPUTE_UNITS             : 1
  CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS      : 3
  CL_DEVICE_MAX_WORK_ITEM_SIZES           : 12 / 12 / 12
  CL_DEVICE_MAX_WORK_GROUP_SIZE           : 12
  CL_DEVICE_MAX_CLOCK_FREQUENCY           : 300 MHz
  CL_DEVICE_ADDRESS_BITS                  : 32
  CL_DEVICE_HOST_UNIFIED_MEMORY           : YES
  CL_DEVICE_MAX_MEM_ALLOC_SIZE            : 76 MByte
  CL_DEVICE_GLOBAL_MEM_SIZE               : 76 MByte
  CL_DEVICE_ERROR_CORRECTION_SUPPORT      : NO
  CL_DEVICE_LOCAL_MEM_TYPE                : global
  CL_DEVICE_LOCAL_MEM_SIZE                : 77824 KByte
  CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE      : 77824 KByte
  CL_DEVICE_QUEUE_PROPERTIES              : CL_QUEUE_PROFILING_ENABLE
  CL_DEVICE_IMAGE_SUPPORT                 : NO
  CL_DEVICE_MAX_READ_IMAGE_ARGS           : 64
  CL_DEVICE_MAX_WRITE_IMAGE_ARGS          : 64
  CL_DEVICE_IMAGE2D_MAX_WIDTH             : 2048
  CL_DEVICE_IMAGE2D_MAX_HEIGHT            : 2048
  CL_DEVICE_IMAGE3D_MAX_WIDTH             : 2048
  CL_DEVICE_IMAGE3D_MAX_HEIGHT            : 2048
  CL_DEVICE_IMAGE3D_MAX_DEPTH             : 2048
  CL_DEVICE_PREFERRED_VECTOR_WIDTH        : CHAR 16 SHORT 16 INT 16 LONG 0 FLOAT 16 DOUBLE 0 
  CL_DEVICE_EXTENSIONS                    : cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_nv_pragma_unroll cl_arm_core_id cl_ext_atomic_counters_32 cl_khr_initialize_memory


If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:
https://computecpp.codeplay.com/releases/v0.7.0/platform-support-notes

********************************************************************************

May 18 '18 15:05 lissyx

Some samples from the TestVC4C testsuite:

$ sudo LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH ./TestVC4C
./example/fft2_2.cl
./example/fibonacci.cl
./example/fibonacci.spt
[W] Fri May 18 15:26:19 2018: Failed to remove empty basic block: label: %13
[W] Fri May 18 15:26:19 2018: Block has explicit predecessor: br %13
./example/hello_world.cl
./example/hello_world_vector.cl
./example/test.cl
[W] Fri May 18 15:26:27 2018: Warnings in precompilation:
[W] Fri May 18 15:26:27 2018: ./example/test.cl:27:6: warning: incompatible pointer to integer conversion initializing 'int' with an expression of type 'int *'; remove &
        int n = &i;
            ^   ~~
1 warning generated.

./example/test_instructions.cl
./example/test_prime.cl

May 18 '18 15:05 lissyx

After fighting with TensorFlow's ComputeCpp branch for cross-compiling for RPi3, I got something being built. It's running as of now, no idea what we can expect so far, both in term of output and in term of speed:

pi@rpi3-opencl-20180518:~/deepspeech $ sudo ./deepspeech ~/tmp/deepspeech/models/tf14.frozen.494_e120.LSTM.ldc93s1.pb ~/tmp/deepspeech/models/alphabet.txt ~/tmp/deepspeech/audio/ -t
TensorFlow: v1.8.0-rc1-1904-g9989353054
DeepSpeech: v0.2.0-alpha.5-0-g7cc8382
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-05-23 13:54:53.988212: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-05-23 13:54:53.988803: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: VideoCore IV GPU, vendor: Broadcom, profile: EMBEDDED_PROFILE
Running on directory /home/pi/tmp/deepspeech/audio/
> /home/pi/tmp/deepspeech/audio//2830-3980-0043.wav
2018-05-23 13:54:54.205643: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.
2018-05-23 13:54:54.227152: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.
2018-05-23 13:54:54.279121: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.
2018-05-23 13:54:54.300364: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.
2018-05-23 13:54:54.322776: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.

May 23 '18 13:05 lissyx

Current status is that we are starting to seriously compile TensorFlow's SYCL kernel, and we are hitting some issues in the VC4CL driver :-)

May 28 '18 18:05 lissyx

A lot of errors were fixed in VC4C and VC4CL, we're hitting issue with LLVM mangling of SPIRV, and with a workaround there's some error about register allocation after.

May 30 '18 11:05 lissyx

No further progress on that: lack of time.

Jul 25 '18 11:07 lissyx

Apologies @lissyx for jumping in on this thread. I am trying to get the Raspberry Pi's GPU visible when running the computecpp_info script but having trouble. Which OS are you running on the Pi3 (Raspbian Stretch is still only 32-bit so I'm guessing won't work properly with ComputeCpp)?

Aug 13 '18 16:08 jplhughes

@McHughes288 I was running Stretch, there was ARM32 ComputeCpp available. FYI it's still stalling, because all the basics are covered and this was now only being blocked by the VC4CL driver not able to digest our kernels. I'm off until september 5th, by then I'll be able to play again with the new, simpler model. Hopefully we'll see some breakthrough.

Aug 13 '18 17:08 lissyx

@edubois @renepeinl So, I've ordered a ROCKPro64 :-)

Sep 26 '18 12:09 lissyx

Cool @lissyx , I'm still waiting for the AI version (RockPro-64-ai)

Sep 26 '18 19:09 edubois

The one I ordered is supposed to have a NPU with NNAPI support, is yours different?

Sep 27 '18 03:09 lissyx

I'm not sure how different they are, but there's two declination of the Rock64Pro: The Rock64Pro having a Rockchip RK3399 and the AI version with a Rockchip RK3399Pro. I might be wrong, tell me if you think I am.

Sep 27 '18 05:09 edubois

Woops, I might have been mislead by the naming ROCKPro64 :/

Sep 27 '18 06:09 lissyx

TF-coriander may be of interest as an OpenCL version of Tensor flow

Jan 08 '19 16:01 bkmgit

Thanks but we want to avoid forks, and ot seems the most active codebase is ComputeCpp one, yet its quite outdated (1.8 last time, coriander is 0.18 so it's even much older).

Jan 08 '19 17:01 lissyx

speaking about outdated versions: the ccpp branch on github looks quite outdated. Is it still the best starting point for getting GPU support on Intel GPUs to work?

Feb 11 '19 16:02 renepeinl

@renepeinl Sorry, but this was published with no more than best-effort for those brave enough to play. Honestly, the OpenCL support of TensorFlow does not looks like a huge priority upstream, so we are focusing efforts elsewhere.

Feb 11 '19 16:02 lissyx

See also #2270 for some TFLite related experiments / progresses.

Jul 24 '19 05:07 lissyx

FTR, with RPi4 and switching to TFLite runtime, we exceed realtime with only one core at 100%. So the incentive to leverage GPU on those boards is getting lower.

Sep 23 '19 20:09 lissyx

Not everyone can afford to upgrade to an RPi 4 :confused:

Also, what if the GPU is being used for something else? And even if the CPU isn't maxed out, would using the GPU anyway yield faster results at all? Worth investigating perhaps?

Sep 23 '19 21:09 sbrl

DeepSpeech DeepSpeech copied to clipboard

Investigate GPUs on ARM boards

DeepSpeech
DeepSpeech copied to clipboard