packages
packages copied to clipboard
NumPy performance improvement
Package name: numpy Link to PyPI page: https://pypi.org/project/numpy/ Link to piwheels page: https://www.piwheels.org/project/numpy/ Version: e.g. 1.16.4 Python version: e.g. 3.7.3
Summary
The piwheels distribution of numpy doesn't use multithreading and is ~half as slow as the conda one. Is it possible to change the building procedure and take advantage of multi threading?
Details
I benchmarked different distributed version of numpy on a rpi3 with the following code:
a = np.random.rand(1000, 1000)
b = np.random.rand(1000, 1000)
with Timer():
c = a @ b
The piwheels and apt python3-numpy version both took ~0.6 sec and used only one core. The conda (https://github.com/jjhelmus/berryconda) distributed one took ~0.3 sec and used all of the cores.
Running np.__config__.show()
revelied that the conda version uses openblas and the piwheels version uses atlas(?) and no threading.
Conda 3.6 with numpy 1.15.1:
>>> np.__config__.show()
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/home/pi/miniconda3/envs/py36/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/home/pi/miniconda3/envs/py36/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/home/pi/miniconda3/envs/py36/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
lapack_opt_info:
libraries = ['openblas', 'openblas']
library_dirs = ['/home/pi/miniconda3/envs/py36/lib']
language = c
define_macros = [('HAVE_CBLAS', None)]
Python 3.7.3 with numpy 1.16.4
blas_mkl_info:
NOT AVAILABLE
blis_info:
NOT AVAILABLE
openblas_info:
NOT AVAILABLE
atlas_3_10_blas_threads_info:
NOT AVAILABLE
atlas_3_10_blas_info:
NOT AVAILABLE
atlas_blas_threads_info:
NOT AVAILABLE
atlas_blas_info:
language = c
define_macros = [('HAVE_CBLAS', None), ('NO_ATLAS_INFO', -1)]
libraries = ['f77blas', 'cblas', 'atlas', 'f77blas', 'cblas']
library_dirs = ['/usr/lib/arm-linux-gnueabihf']
accelerate_info:
NOT AVAILABLE
blas_opt_info:
language = c
define_macros = [('HAVE_CBLAS', None), ('NO_ATLAS_INFO', -1)]
libraries = ['f77blas', 'cblas', 'atlas', 'f77blas', 'cblas']
library_dirs = ['/usr/lib/arm-linux-gnueabihf']
lapack_mkl_info:
NOT AVAILABLE
openblas_lapack_info:
NOT AVAILABLE
openblas_clapack_info:
NOT AVAILABLE
atlas_3_10_threads_info:
NOT AVAILABLE
atlas_3_10_info:
NOT AVAILABLE
atlas_threads_info:
NOT AVAILABLE
atlas_info:
language = f77
libraries = ['lapack', 'f77blas', 'cblas', 'atlas', 'f77blas', 'cblas']
library_dirs = ['/usr/lib/arm-linux-gnueabihf']
define_macros = [('NO_ATLAS_INFO', -1)]
lapack_opt_info:
language = f77
libraries = ['lapack', 'f77blas', 'cblas', 'atlas', 'f77blas', 'cblas']
library_dirs = ['/usr/lib/arm-linux-gnueabihf']
define_macros = [('NO_ATLAS_INFO', -1)]
Thanks for the detailed info, it's really appreciated. Usually when people ask for something like this they don't have anything to back it up.
Do you know how to build an optimised wheel?
I reproduced the conda build recipe achieving ~0.3 sec:
- I built the latest (develop / r0.3.6) version of openblas with this script targeting ARMv7: https://github.com/jjhelmus/berryconda/blob/master/recipes/openblas/build.sh
Manually building this version results in:
- I built numpy with the proper sites.cfg file (https://github.com/jjhelmus/berryconda/blob/master/recipes/numpy/build.sh)
This requires the fortran compiler apt install gfortran
However
This has the problem that users have to install libopenblas somehow and tell python where it can find the .so file (e.g. LD_LIBRARY_PATH). Conda has the clear advantage here.
Apt only supplies a armv6p-r0.3.5 version which achieved ~0.55 sec. This performance difference is due to the targeted architecture?
TL;DR
- armhf requires that packages support VFPv3-D16, therefore OpenBlas is compiled for ARMv6, hence the 2x speedup.
- ~~Raspbery Pi 4 will be fine because ARM64 -> ARMv8.~~ Raspbian doesn't have a 64bit version
- ~~Experimenting with Neon right now.~~ Deprecated
Details
Digging a bit deeper I found that the armhf version of debian has a minimum requirement of armv7 AND VFPv3-D16. (https://wiki.debian.org/ArmHardFloatPort#Hardware)
OpenBlas doesn't support VFPv3-D16, only VFPv3-D32 (https://github.com/xianyi/OpenBLAS/issues/388), therefore the apt package is using ARMv6:
- Add armhf support. - Use ARMv6 target. We cannot currently use the ARMv7 target, because it requires VFPv3-D32 (and armhf only guarantees VFPv3-D16).
This is probably fixed in a newer version, but the result is the same? Using ARMv7 with FPU16 only should result in the same performance.
- Fix crash with illegal instruction on armhf with static libraries. + d/p/arm-gcc-flags.patch: enforce -march=armv7-a and -mfpu=vfpv3-d16 flags.
(https://metadata.ftp-master.debian.org/changelogs//main/o/openblas/openblas_0.2.19-3_changelog)
This is probably the exact reason for the 2x speedup. It can't use 32 bit FP registers.
This might not be an issue for the rpi 4 because the arm64 package uses the ARMv8 build target. It might even work "out of the box". @TegzesTamas, please benchmark this.
Enable build on arm64 architecture. + d/control: add arm64 to Architecture fields. + d/rules: use TARGET=ARMV8 for arm64 arch. + d/p/arm64.patch: new patch from upstream, to fix a build failure.
(https://tracker.debian.org/media/packages/o/openblas/changelog-0.3.6ds-1)
Neon This makes me believe that OpenBlas can be compiled with Neon: https://salsa.debian.org/science-team/openblas/blob/master/Makefile.arm#L3
Looks like VFPv3 in ARMv7 is faster and NEON is deprecated: https://github.com/xianyi/OpenBLAS/issues/562#issuecomment-99801945