numexpr icon indicating copy to clipboard operation
numexpr copied to clipboard

NE3: Newer cpuinfo.py

Open robbmcleod opened this issue 7 years ago • 12 comments

The current CPU interrogation utility is fairly old, investigate if there is a better one available with BSD license that could be included in NE3 to provide support for a wider range of compiler flags.

Alternatively we will likely need to resort to test compiling very short C codes to assess what is possible with the compilers/architecture on the target system.

We would also like to be able to detect the L1 and L2 cache size, so that the BLOCK_SIZE could be adjusted at compile-time.

We would like to be able to detect the number of physical (as opposed to virtual cores) as hyperthreading usually slows down NumExpr.

robbmcleod avatar Mar 14 '17 03:03 robbmcleod

A new cpuinfo.py has been pushed now in https://github.com/pydata/numexpr/commit/d35a846b1454f90d5d12be88100893b3cdd0a268 but it's quite slow so it will need some work.

robbmcleod avatar Sep 14 '17 04:09 robbmcleod

Just in case it help, in Theano, we do this call to gcc: gcc -march=native -E -v -. And this output something like:

Using built-in specs.
COLLECT_GCC=gcc
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Debian 4.9.2-10' --with-bugurl=file:///usr/share/doc/gcc-4.9/README.Bugs --enable-languages=c,c++,java,go,d,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.9 --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.9 --libdir=/usr/lib --enable-nls --with-sysroot=/ --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --with-system-zlib --disable-browser-plugin --enable-java-awt=gtk --enable-gtk-cairo --with-java-home=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64/jre --enable-java-home --with-jvm-root-dir=/usr/lib/jvm/java-1.5.0-gcj-4.9-amd64 --with-jvm-jar-dir=/usr/lib/jvm-exports/java-1.5.0-gcj-4.9-amd64 --with-arch-directory=amd64 --with-ecj-jar=/usr/share/java/eclipse-ecj.jar --enable-objc-gc --enable-multiarch --with-arch-32=i586 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu
Thread model: posix
gcc version 4.9.2 (Debian 4.9.2-10) 
COLLECT_GCC_OPTIONS='-march=native' '-E' '-v'
 /usr/lib/gcc/x86_64-linux-gnu/4.9/cc1 -E -quiet -v -imultiarch x86_64-linux-gnu - -march=haswell -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=15360 -mtune=generic
ignoring nonexistent directory "/usr/local/include/x86_64-linux-gnu"
ignoring nonexistent directory "/usr/lib/gcc/x86_64-linux-gnu/4.9/../../../../x86_64-linux-gnu/include"
#include "..." search starts here:
#include <...> search starts here:
 /u/bastienf/.local/include
 /Tmp/lisa/os_v5/cudnn_v7/cuda9.0
 /Tmp/lisa/os_v5/include
 .
 /usr/lib/gcc/x86_64-linux-gnu/4.9/include
 /usr/local/include
 /usr/lib/gcc/x86_64-linux-gnu/4.9/include-fixed
 /usr/include/x86_64-linux-gnu
 /usr/include
End of search list.

There is this line: /usr/lib/gcc/x86_64-linux-gnu/4.9/cc1 -E -quiet -v -imultiarch x86_64-linux-gnu - -march=haswell -mmmx -mno-3dnow -msse -msse2 -msse3 -mssse3 -mno-sse4a -mcx16 -msahf -mmovbe -maes -mno-sha -mpclmul -mpopcnt -mabm -mno-lwp -mfma -mno-fma4 -mno-xop -mbmi -mbmi2 -mno-tbm -mavx -mavx2 -msse4.2 -msse4.1 -mlzcnt -mno-rtm -mno-hle -mrdrnd -mf16c -mfsgsbase -mno-rdseed -mno-prfchw -mno-adx -mfxsr -mxsave -mxsaveopt -mno-avx512f -mno-avx512er -mno-avx512cd -mno-avx512pf -mno-prefetchwt1 --param l1-cache-size=32 --param l1-cache-line-size=64 --param l2-cache-size=15360 -mtune=generic

At the end, you see what gcc use for cache line --param l1-cache-size=32, --param l1-cache-line-size=64 --param l2-cache-size=15360.

In case this help you. Note, this isn't 100% reliable. Some gcc version had bad detection implemented for some combination of gcc version and CPU.

This would be just one external call.

nouiz avatar Sep 14 '17 14:09 nouiz

Hi Frédéric, thanks for the suggestion. One problem I can see is a lot of people who have NumExpr installed don't have compilers, they're using pre-built wheels. I should mention aside from CPU flags, which are important for compiling, the feature I'd like to add is the ability to detect both virtual core and physical core counts. As hyperthreading is never been beneficial for NumExpr in my experience, but most simple core checks give virtual cores. Some of the newer CPUs can report the difference but older ones (like v2 Xeons) typically don't.

robbmcleod avatar Sep 14 '17 15:09 robbmcleod

2017-09-14 17:45 GMT+02:00 Robert McLeod [email protected]:

Hi Frédéric, thanks for the suggestion. One problem I can see is a lot of people who have NumExpr installed don't have compilers, they're using pre-built wheels. I should mention aside from CPU flags, which are important for compiling, the feature I'd like to add is the ability to detect both virtual core and physical core counts. As hyperthreading is never been beneficial for NumExpr in my experience,

​Beware, my experience shows that hyperthreading could be beneficial for CPU-bounded computations (around 25%, which matches well with Intel predictions). See for example the 'Timings-Escher.pdf' in my materials for our 2011 ASPP course:

https://python.g-node.org/python-summerschool-2011/_media/materials/starving_cpus/starvingcpus-solutions.tar.gz ​

​Escher was a machine with 8 physical cores and hyperthreading.​

​Francesc​

but most simple core checks give virtual cores. Some of the newer CPUs can report the difference but older ones (like v2 Xeons) typically don't.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/pydata/numexpr/issues/259#issuecomment-329523964, or mute the thread https://github.com/notifications/unsubscribe-auth/AATMmfYFey1XLorq-4wI-HphmxRmlOHhks5siUoBgaJpZM4McGFq .

-- Francesc Alted

FrancescAlted avatar Sep 14 '17 16:09 FrancescAlted

Hi Francesc,

Thanks. I notice though that your openMPI benchmarks stop scaling at 8 cores.

I starting putting in per-thread benchmarks. I'm not sure if they are working exactly as intended, apparently threads can sometimes miss timing information, but on first glance sometimes there's load balancing issues. This could be due to web browsers and other apps stealing some clock cycles, or there could be some issues with how the NpyIter partitions the array.

Another thing to test at some point would be thread pinning, which might help particularly with multi-CPU nodes where there's physically different L2 caches:

https://en.wikipedia.org/wiki/Processor_affinity

I don't know if there's also a way to allocate memory to take advantage of thread affinity. I've seen it on co-processors though.

robbmcleod avatar Sep 14 '17 17:09 robbmcleod

Yeah, thread pinning could help, specially on setups where there are separate CPUs. Perhaps that would be a case of over-optimization, but at any rate, it would be a good test.

FrancescAlted avatar Sep 14 '17 18:09 FrancescAlted

Here's an example of some of the thread barrier benchmarking results from the submodule I'm working on:

numexpr3_barrier_benchmark

In this case, the last thread never even starts. Usually all the threads start but even then the barrier takes about 15-20 % of the run time for 64k arrays. It could be a Windows issue. I'll have to install a dual-boot on my machine with a Linux distro and see if this is a Window's issue, as I don't have a lot of confidence in my VirtualBox results.

robbmcleod avatar Sep 17 '17 19:09 robbmcleod

That's weird indeed, but I must say that I don't use to do any benchmark on Windows. FWIW Numexpr on Windows is using an emulation of the pthreads API that has been copied back from the git project (see sources here). But doing a test on Linux will certainly help to pinpoint the problem.

FrancescAlted avatar Sep 18 '17 12:09 FrancescAlted

This turned out to be a thread safety issue with calling the Windows function QueryPerformanceCounter in the benchmarking macros. I fixed it and now threads only fail to start if I'm in the hyperthreading regime.

Performance is looking nice on my i7-7820X but I still have to work on enabling AVX512:

ne3_thread_scaling

robbmcleod avatar Dec 11 '17 00:12 robbmcleod

I tried to make cpuinfo.py detect AVX2 but it has so many branches I think testing on different platforms would be painful. My current thinking is to make a Python wrapper for:

https://github.com/Mysticial/FeatureDetector

It relies on Intel's cpuid which has docs in Vol 2A pp. 3-190 to 3-205:

https://software.intel.com/en-us/articles/intel-sdm

We would need to figure out how to deal with ARM separately from x64's cpuid.

We would also need to figure out how to make setuptools and pip compile and run it first before starting with NumExpr.

robbmcleod avatar Dec 30 '17 17:12 robbmcleod

FeatureDetector looks nice, although I am lately quite reluctant to include C++ code in my projects (it has bitten me in a few occasions already). But again, looks nice enough.

FrancescAlted avatar Dec 30 '17 18:12 FrancescAlted

I made an effort of porting it to C99, and adding cache size and core counting:

https://github.com/robbmcleod/cpufeature

If you have a chance, check it out and see if it compiles and runs for you. I've not been able to test it on any AMD machines, as I don't own any, nor OSX.

robbmcleod avatar Jan 02 '18 01:01 robbmcleod

Message to comment on stale issues. If none provided, will not mark issues stale

github-actions[bot] avatar Feb 21 '24 01:02 github-actions[bot]