DeepSpeech
DeepSpeech copied to clipboard
Investigate GPUs on ARM boards
We have ARM and ARM64 (#1305) arriving on RPi3B+ and LePotato boards. We should try and see if we can get GPU acceleration there:
- OpenCL can work not too bad on Intel GPUs (cf
ccpp
) using ComputeCpp - LePotato has a MALI-450 GPU, no idea yet how much we can expect of OpenCL there
- RPi3B (and
+
model) now have OpenCL 1.2 Embedded Profile compliant https://github.com/doe300/VC4CL#opencl-support and ComputeCpp has ARMv7 binaries targetting Ubuntu 14.04, so we might be able to get stuff to work there.
The Rock64Pro-AI has a mali-t860, I would consider this option.
@edubois You're welcome to explore that, but I'll stick to what I have right now :)
Will do,
You state ARM boards but then mention Intel GPUs. Therefore, I would offer to do some testing on the UP board which is powered by a 4-core Atom x5-Z8350, has 4 GB of RAM and incorporates a Cherry Trail HD graphics with 12 execution units. We are using that, since it is compatible with the Matrix Voice, our hardware for sound capturing.
We are also looking at the Intel Movidius Neural Compute Stick, which is compatible with Tensor Flow and the Raspberry Pi 3, since the combination would be more cost-effective.
The Rock64Pro-AI looks interesting as well and would be even cheaper. I'm curious on the results of @edubois
@renepeinl I have actually been experimenting for quite some time on Intel GPUs on my laptop, debugging and checking performances (thanks to CodePlay people and Intel people), so I know we can get it working with the "Neo" driver, which is far from being released yet, sadly. The Compute Stick is useless in our case, because of RNNs. The previous driver, Beignet, was a dead-end: not working with ComputeCpp (layer used by TensorFlow for OpenCL), and not being actively developped anymore by intel.
People who want to experiment should use the ccpp
branch of our TensorFlow and DeepSpeech repo, but be aware it's hack in progress :)
Thanks for these information. Could you provide some hints on compiler flags for building the software as well? I'm not sure how much influence they have and we are mainly Java developers with no deeper knowledge about C++.
@renepeinl it should be pretty simple, if you follow the docs we have in place and TensorFlow's building doc. For ComputeCpp branch, you'll need to download ComputeCpp matching version. Since it's hack in progress, I have not documented that, but you can look the tc-*.sh
shell scripts in our TensorFlow's repo, it should contain everything. Basically, Bazel v0.10.0, ComputeCpp 0.5.1 (I think?) and proper ./configure
flags (check tc-vars.sh
mostly). On the DeepSpeech build side, it should not change OpenCL or not.
@renepeinl If you run into issues, you can join us on IRC (#machinelearning on irc.mozilla.org) or on Discourse: https://discourse.mozilla.org/c/deep-speech
@renepeinl, the chip is not yet available, will start when I get one, probably in Sept.
Good first milestone on RPi3:
- I have llvm-spirv cross-built
- I have been able to cross-compile vc4c, vc4clstdlib and vc4cl bits
- Debian packages properly installs on Raspbian
- ComputeCpp 0.7.0 for Ubuntu 14.04 / ARM32 shows VC4 GPU
- Currently running the VC4C testsuite, there are failures, but there are successes as well, meaning the basics of the infra is there and working
$ sudo ComputeCpp-CE-0.7.0-Ubuntu-14.04-ARM_32/bin/computecpp_info --verbose --use-spirv
********************************************************************************
ComputeCpp Info (CE 0.7.0)
********************************************************************************
Toolchain information:
GLIBC version: 2.24
GLIBCXX: 20150426
This version of libstdc++ is supported.
********************************************************************************
Device Info:
Discovered 1 devices matching:
platform : <any>
device type : <any>
--------------------------------------------------------------------------------
Device 0:
Device is supported : NO - Device does not support SPIR
CL_DEVICE_NAME : VideoCore IV GPU
CL_DEVICE_VENDOR : Broadcom
CL_DRIVER_VERSION : 0.4
CL_DEVICE_TYPE : CL_DEVICE_TYPE_GPU
CL_DEVICE_VERSION : OpenCL 1.2 VC4CL 0.4
CL_DEVICE_PROFILE : EMBEDDED_PROFILE
CL_DEVICE_MAX_COMPUTE_UNITS : 1
CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS : 3
CL_DEVICE_MAX_WORK_ITEM_SIZES : 12 / 12 / 12
CL_DEVICE_MAX_WORK_GROUP_SIZE : 12
CL_DEVICE_MAX_CLOCK_FREQUENCY : 300 MHz
CL_DEVICE_ADDRESS_BITS : 32
CL_DEVICE_HOST_UNIFIED_MEMORY : YES
CL_DEVICE_MAX_MEM_ALLOC_SIZE : 76 MByte
CL_DEVICE_GLOBAL_MEM_SIZE : 76 MByte
CL_DEVICE_ERROR_CORRECTION_SUPPORT : NO
CL_DEVICE_LOCAL_MEM_TYPE : global
CL_DEVICE_LOCAL_MEM_SIZE : 77824 KByte
CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE : 77824 KByte
CL_DEVICE_QUEUE_PROPERTIES : CL_QUEUE_PROFILING_ENABLE
CL_DEVICE_IMAGE_SUPPORT : NO
CL_DEVICE_MAX_READ_IMAGE_ARGS : 64
CL_DEVICE_MAX_WRITE_IMAGE_ARGS : 64
CL_DEVICE_IMAGE2D_MAX_WIDTH : 2048
CL_DEVICE_IMAGE2D_MAX_HEIGHT : 2048
CL_DEVICE_IMAGE3D_MAX_WIDTH : 2048
CL_DEVICE_IMAGE3D_MAX_HEIGHT : 2048
CL_DEVICE_IMAGE3D_MAX_DEPTH : 2048
CL_DEVICE_PREFERRED_VECTOR_WIDTH : CHAR 16 SHORT 16 INT 16 LONG 0 FLOAT 16 DOUBLE 0
CL_DEVICE_EXTENSIONS : cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_nv_pragma_unroll cl_arm_core_id cl_ext_atomic_counters_32 cl_khr_initialize_memory
If you encounter problems when using any of these OpenCL devices, please consult
this website for known issues:
https://computecpp.codeplay.com/releases/v0.7.0/platform-support-notes
********************************************************************************
Some samples from the TestVC4C
testsuite:
$ sudo LD_LIBRARY_PATH=$(pwd):$LD_LIBRARY_PATH ./TestVC4C
./example/fft2_2.cl
./example/fibonacci.cl
./example/fibonacci.spt
[W] Fri May 18 15:26:19 2018: Failed to remove empty basic block: label: %13
[W] Fri May 18 15:26:19 2018: Block has explicit predecessor: br %13
./example/hello_world.cl
./example/hello_world_vector.cl
./example/test.cl
[W] Fri May 18 15:26:27 2018: Warnings in precompilation:
[W] Fri May 18 15:26:27 2018: ./example/test.cl:27:6: warning: incompatible pointer to integer conversion initializing 'int' with an expression of type 'int *'; remove &
int n = &i;
^ ~~
1 warning generated.
./example/test_instructions.cl
./example/test_prime.cl
After fighting with TensorFlow's ComputeCpp branch for cross-compiling for RPi3, I got something being built. It's running as of now, no idea what we can expect so far, both in term of output and in term of speed:
pi@rpi3-opencl-20180518:~/deepspeech $ sudo ./deepspeech ~/tmp/deepspeech/models/tf14.frozen.494_e120.LSTM.ldc93s1.pb ~/tmp/deepspeech/models/alphabet.txt ~/tmp/deepspeech/audio/ -t
TensorFlow: v1.8.0-rc1-1904-g9989353054
DeepSpeech: v0.2.0-alpha.5-0-g7cc8382
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2018-05-23 13:54:53.988212: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:70] Found following OpenCL devices:
2018-05-23 13:54:53.988803: I ./tensorflow/core/common_runtime/sycl/sycl_device.h:72] id: 0, type: GPU, name: VideoCore IV GPU, vendor: Broadcom, profile: EMBEDDED_PROFILE
Running on directory /home/pi/tmp/deepspeech/audio/
> /home/pi/tmp/deepspeech/audio//2830-3980-0043.wav
2018-05-23 13:54:54.205643: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.
2018-05-23 13:54:54.227152: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.
2018-05-23 13:54:54.279121: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.
2018-05-23 13:54:54.300364: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.
2018-05-23 13:54:54.322776: W ./tensorflow/core/framework/allocator.cc:108] Allocation of 11713728 exceeds 10% of system memory.
Current status is that we are starting to seriously compile TensorFlow's SYCL kernel, and we are hitting some issues in the VC4CL driver :-)
A lot of errors were fixed in VC4C and VC4CL, we're hitting issue with LLVM mangling of SPIRV, and with a workaround there's some error about register allocation after.
No further progress on that: lack of time.
Apologies @lissyx for jumping in on this thread. I am trying to get the Raspberry Pi's GPU visible when running the computecpp_info script but having trouble. Which OS are you running on the Pi3 (Raspbian Stretch is still only 32-bit so I'm guessing won't work properly with ComputeCpp)?
@McHughes288 I was running Stretch, there was ARM32 ComputeCpp available. FYI it's still stalling, because all the basics are covered and this was now only being blocked by the VC4CL driver not able to digest our kernels. I'm off until september 5th, by then I'll be able to play again with the new, simpler model. Hopefully we'll see some breakthrough.
@edubois @renepeinl So, I've ordered a ROCKPro64 :-)
Cool @lissyx , I'm still waiting for the AI version (RockPro-64-ai)
The one I ordered is supposed to have a NPU with NNAPI support, is yours different?
I'm not sure how different they are, but there's two declination of the Rock64Pro: The Rock64Pro having a Rockchip RK3399 and the AI version with a Rockchip RK3399Pro. I might be wrong, tell me if you think I am.
Woops, I might have been mislead by the naming ROCKPro64 :/
TF-coriander may be of interest as an OpenCL version of Tensor flow
Thanks but we want to avoid forks, and ot seems the most active codebase is ComputeCpp one, yet its quite outdated (1.8 last time, coriander is 0.18 so it's even much older).
speaking about outdated versions: the ccpp branch on github looks quite outdated. Is it still the best starting point for getting GPU support on Intel GPUs to work?
@renepeinl Sorry, but this was published with no more than best-effort for those brave enough to play. Honestly, the OpenCL support of TensorFlow does not looks like a huge priority upstream, so we are focusing efforts elsewhere.
See also #2270 for some TFLite related experiments / progresses.
FTR, with RPi4 and switching to TFLite runtime, we exceed realtime with only one core at 100%. So the incentive to leverage GPU on those boards is getting lower.
Not everyone can afford to upgrade to an RPi 4 :confused:
Also, what if the GPU is being used for something else? And even if the CPU isn't maxed out, would using the GPU anyway yield faster results at all? Worth investigating perhaps?