numexpr
numexpr copied to clipboard
NE3: Newer cpuinfo.py
The current CPU interrogation utility is fairly old, investigate if there is a better one available with BSD license that could be included in NE3 to provide support for a wider range of compiler flags.
Alternatively we will likely need to resort to test compiling very short C codes to assess what is possible with the compilers/architecture on the target system.
We would also like to be able to detect the L1 and L2 cache size, so that the BLOCK_SIZE could be adjusted at compile-time.
We would like to be able to detect the number of physical (as opposed to virtual cores) as hyperthreading usually slows down NumExpr.
A new cpuinfo.py
has been pushed now in https://github.com/pydata/numexpr/commit/d35a846b1454f90d5d12be88100893b3cdd0a268 but it's quite slow so it will need some work.
Just in case it help, in Theano, we do this call to gcc: gcc -march=native -E -v -
. And this output something like:
Using built-in specs.
COLLECT_GCC=gcc
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.9.2-10' --with-bugurl=file:///usr/share/doc/gcc-4.9/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.9 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.9 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.9-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --with-arch-32=i586 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.9.2 (Debian 4.9.2-10)
COLLECT_GCC_OPTIONS='-march=native' '-E' '-v'
/usr/lib/gcc/x86_64-linux-gnu/4.9/cc1 -E -quiet -v -imultiarch x86_64-linux-gnu - -march=haswell -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=15360 -mtune=generic
ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../../x86_64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
/u/bastienf/.local/include
/Tmp/lisa/os_v5/cudnn_v7/cuda9.0
/Tmp/lisa/os_v5/include
.
/usr/lib/gcc/x86_64-linux-gnu/4.9/include
/usr/local/include
/usr/lib/gcc/x86_64-linux-gnu/4.9/include-fixed
/usr/include/x86_64-linux-gnu
/usr/include
End of search list.
There is this line:
/usr/lib/gcc/x86_64-linux-gnu/4.9/cc1 -E -quiet -v -imultiarch x86_64-linux-gnu - -march=haswell -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=15360 -mtune=generic
At the end, you see what gcc use for cache line --param l1-cache-size=32
, --param l1-cache-line-size=64
--param l2-cache-size=15360
.
In case this help you. Note, this isn't 100% reliable. Some gcc version had bad detection implemented for some combination of gcc version and CPU.
This would be just one external call.
Hi Frédéric, thanks for the suggestion. One problem I can see is a lot of people who have NumExpr installed don't have compilers, they're using pre-built wheels. I should mention aside from CPU flags, which are important for compiling, the feature I'd like to add is the ability to detect both virtual core and physical core counts. As hyperthreading is never been beneficial for NumExpr in my experience, but most simple core checks give virtual cores. Some of the newer CPUs can report the difference but older ones (like v2 Xeons) typically don't.
2017-09-14 17:45 GMT+02:00 Robert McLeod [email protected]:
Hi Frédéric, thanks for the suggestion. One problem I can see is a lot of people who have NumExpr installed don't have compilers, they're using pre-built wheels. I should mention aside from CPU flags, which are important for compiling, the feature I'd like to add is the ability to detect both virtual core and physical core counts. As hyperthreading is never been beneficial for NumExpr in my experience,
Beware, my experience shows that hyperthreading could be beneficial for CPU-bounded computations (around 25%, which matches well with Intel predictions). See for example the 'Timings-Escher.pdf' in my materials for our 2011 ASPP course:
https://python.g-node.org/python-summerschool-2011/_media/materials/starving_cpus/starvingcpus-solutions.tar.gz
Escher was a machine with 8 physical cores and hyperthreading.
Francesc
but most simple core checks give virtual cores. Some of the newer CPUs can report the difference but older ones (like v2 Xeons) typically don't.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pydata/numexpr/issues/259#issuecomment-329523964, or mute the thread https://github.com/notifications/unsubscribe-auth/AATMmfYFey1XLorq-4wI-HphmxRmlOHhks5siUoBgaJpZM4McGFq .
-- Francesc Alted
Hi Francesc,
Thanks. I notice though that your openMPI benchmarks stop scaling at 8 cores.
I starting putting in per-thread benchmarks. I'm not sure if they are working exactly as intended, apparently threads can sometimes miss timing information, but on first glance sometimes there's load balancing issues. This could be due to web browsers and other apps stealing some clock cycles, or there could be some issues with how the NpyIter partitions the array.
Another thing to test at some point would be thread pinning, which might help particularly with multi-CPU nodes where there's physically different L2 caches:
https://en.wikipedia.org/wiki/Processor_affinity
I don't know if there's also a way to allocate memory to take advantage of thread affinity. I've seen it on co-processors though.
Yeah, thread pinning could help, specially on setups where there are separate CPUs. Perhaps that would be a case of over-optimization, but at any rate, it would be a good test.
Here's an example of some of the thread barrier benchmarking results from the submodule I'm working on:
In this case, the last thread never even starts. Usually all the threads start but even then the barrier takes about 15-20 % of the run time for 64k arrays. It could be a Windows issue. I'll have to install a dual-boot on my machine with a Linux distro and see if this is a Window's issue, as I don't have a lot of confidence in my VirtualBox results.
That's weird indeed, but I must say that I don't use to do any benchmark on Windows. FWIW Numexpr on Windows is using an emulation of the pthreads API that has been copied back from the git project (see sources here). But doing a test on Linux will certainly help to pinpoint the problem.
This turned out to be a thread safety issue with calling the Windows function QueryPerformanceCounter
in the benchmarking macros. I fixed it and now threads only fail to start if I'm in the hyperthreading regime.
Performance is looking nice on my i7-7820X but I still have to work on enabling AVX512:
I tried to make cpuinfo.py
detect AVX2 but it has so many branches I think testing on different platforms would be painful. My current thinking is to make a Python wrapper for:
https://github.com/Mysticial/FeatureDetector
It relies on Intel's cpuid
which has docs in Vol 2A pp. 3-190 to 3-205:
https://software.intel.com/en-us/articles/intel-sdm
We would need to figure out how to deal with ARM separately from x64's cpuid
.
We would also need to figure out how to make setuptools
and pip
compile and run it first before starting with NumExpr.
FeatureDetector looks nice, although I am lately quite reluctant to include C++ code in my projects (it has bitten me in a few occasions already). But again, looks nice enough.
I made an effort of porting it to C99, and adding cache size and core counting:
https://github.com/robbmcleod/cpufeature
If you have a chance, check it out and see if it compiles and runs for you. I've not been able to test it on any AMD machines, as I don't own any, nor OSX.
Message to comment on stale issues. If none provided, will not mark issues stale