performance on AMD Ryzen and Threadripper
This report comes right from #1425 where the discussion drifted off from thread safety in openblas v. 0.2.20 to performance on AMD Ryzen and Threadripper processors (in this particular case a TR 1950X). I seems worthwhile to discuss this in a separate thread. Until now we had the following discussion
@tkswe88 : After plenty of initial tests with the AMD TR 1950X processor, it seems that openblas (tested 0.2.19, 0.2.20 and the development version on Ubuntu 17.10 with kernel 4.13, gcc and gfortran v. 7.2) operates roughly 20% slower on the TR1950X than on an i7-4770 (4 cores) when using 1, 2 and 4 threads. This is somewhat surprising given that both CPUs use at most AVX2 and thus should be comparable in terms of vectorisation potential. I have already adjusted the OpenMP thread affinity to exclude that the (hidden) NUMA architecture of the TR1950X causes its lower performance. Other measures I took were 1) BIOS upgrade, 2) Linux kernel upgrade to 4.15, 3) increased DIMM frequency from 2133 to 2666 MHz. Except for the latter, which gave a speedup of roughly 3%, these measures did not have any effect on execution speed. Do you have any idea where the degraded performance on the TR1950X comes from? Is this related to a current lack of optimization in openblas or do we just have to wait for the next major release of gcc/gfortran to fix the problem? Of course, I would be willing to run tests, if this was of help in developing openblas.
@brada4 : AMD has slightly slower AVX and AVX2 units per CPU, by no means slow in general, it still has heap of cores spare. Sometimes optimal AVX2 saturation means turning whole CPU cartridge to base, i.e non-turpo frequency.
@martin-frbg: Could also be that getarch is mis-detecting the cache sizes on TR, or the various hardcoded block sizes from param.h for loop unrolling are "more wrong" on TR than they were on the smaller Ryzen. Overall support for the Ryzen architecture is currently limited to treating it like Haswell, see #1133,1147. There may be other assembly instructions besides AVX2 that are slower on Ryzen (#1147 mentions movntp*).
@tkswe88: Are there any tests I could do to find out about cache size detection errors or more appropriate settings for loop unroling?
@brada4 What you asked to martin - copied parameters may need doubled or halved at least here: https://github.com/xianyi/OpenBLAS/pull/1133/files#diff-7a3ef0fabb9c6c40aac5ae459c3565f0
@martin-frbg: You could simply check the autodetected information in config.h against the specification. (As far as I can determine, L3 size seems to be ignored as a parameter). As far as appropriate settings go, the benchmark directory contains a few tests that can be used for timing individual functions. Adjusting the values in param.h (see https://github.com/xianyi/OpenBLAS/pull/1157/files) is a bit of a black art though.
Here, I report on a first test with modified cache sizes.
The cache sizes reported from lstopo are for every core L1d: 32KB L1i: 64KB L2: 512KB and for each of the four CCXs (4 cores per CCX) L3: 8MB (i.e. 32 MB in total)
Hence, I performed a test on the current development version.
Assuming L1 and L2 would have to be reported per core and L3 in total, I modified the relevant lines of getarch.c to
#if defined (FORCE_ZEN)
#define FORCE
#define FORCE_INTEL
#define ARCHITECTURE "X86"
#define SUBARCHITECTURE "ZEN"
#define ARCHCONFIG "-DZEN "
"-DL1_CODE_SIZE=65536 -DL1_CODE_LINESIZE=64 -DL1_CODE_ASSOCIATIVE=8 "
"-DL1_DATA_SIZE=32768 -DL1_DATA_LINESIZE=64 -DL2_CODE_ASSOCIATIVE=8 "
"-DL2_SIZE=524288 -DL2_LINESIZE=64 -DL2_ASSOCIATIVE=8 "
"-DL3_SIZE=33554432 -DL3_LINESIZE=64 -DL3_ASSOCIATIVE=8 "
"-DITB_DEFAULT_ENTRIES=64 -DITB_SIZE=4096 "
"-DDTB_DEFAULT_ENTRIES=64 -DDTB_SIZE=4096 "
"-DHAVE_MMX -DHAVE_SSE -DHAVE_SSE2 -DHAVE_SSE3 -DHAVE_SSE4_1 -DHAVE_SSE4_2 "
"-DHAVE_SSE4A -DHAVE_MISALIGNSSE -DHAVE_128BITFPU -DHAVE_FASTMOVU -DHAVE_CFLUSH "
"-DHAVE_AVX -DHAVE_FMA3 -DFMA3"
#define LIBNAME "zen"
#define CORENAME "ZEN"
#endif
I hope these settings are correctly translated from lstopo. Next, I ran sudo make clean sudo make TARGET=ZEN USE_OPENMP=1 BINARY=64 FC=gfortran
After this, config.h reads as #define OS_LINUX 1 #define ARCH_X86_64 1 #define C_GCC 1 #define 64BIT 1 #define PTHREAD_CREATE_FUNC pthread_create #define BUNDERSCORE _ #define NEEDBUNDERSCORE 1 #define ZEN #define L1_CODE_SIZE 65536 #define L1_CODE_LINESIZE 64 #define L1_CODE_ASSOCIATIVE 8 #define L1_DATA_SIZE 32768 #define L1_DATA_LINESIZE 64 #define L2_CODE_ASSOCIATIVE 8 #define L2_SIZE 524288 #define L2_LINESIZE 64 #define L2_ASSOCIATIVE 8 #define L3_SIZE 33554432 #define L3_LINESIZE 64 #define L3_ASSOCIATIVE 8 #define ITB_DEFAULT_ENTRIES 64 #define ITB_SIZE 4096 #define DTB_DEFAULT_ENTRIES 64 #define DTB_SIZE 4096 #define HAVE_MMX #define HAVE_SSE #define HAVE_SSE2 #define HAVE_SSE3 #define HAVE_SSE4_1 #define HAVE_SSE4_2 #define HAVE_SSE4A #define HAVE_MISALIGNSSE #define HAVE_128BITFPU #define HAVE_FASTMOVU #define HAVE_CFLUSH #define HAVE_AVX #define HAVE_FMA3 #define FMA3 #define CORE_ZEN #define CHAR_CORENAME "ZEN" #define SLOCAL_BUFFER_SIZE 24576 #define DLOCAL_BUFFER_SIZE 32768 #define CLOCAL_BUFFER_SIZE 12288 #define ZLOCAL_BUFFER_SIZE 8192 #define GEMM_MULTITHREAD_THRESHOLD 4
Eventually, I installed using sudo make PREFIX=/usr/local install
After this, config.h has changed and reads as #define OS_LINUX 1 #define ARCH_X86_64 1 #define C_GCC 1 #define 64BIT 1 #define PTHREAD_CREATE_FUNC pthread_create #define BUNDERSCORE _ #define NEEDBUNDERSCORE 1 #define ZEN #define L1_CODE_SIZE 32768 #define L1_CODE_ASSOCIATIVE 8 #define L1_CODE_LINESIZE 64 #define L1_DATA_SIZE 32768 #define L1_DATA_ASSOCIATIVE 8 #define L1_DATA_LINESIZE 64 #define L2_SIZE 524288 #define L2_ASSOCIATIVE 8 #define L2_LINESIZE 64 #define L3_SIZE 33554432 #define L3_ASSOCIATIVE 10 #define L3_LINESIZE 64 #define ITB_SIZE 4096 #define ITB_ASSOCIATIVE 0 #define ITB_ENTRIES 64 #define DTB_SIZE 4096 #define DTB_ASSOCIATIVE 0 #define DTB_DEFAULT_ENTRIES 64 #define HAVE_CMOV #define HAVE_MMX #define HAVE_SSE #define HAVE_SSE2 #define HAVE_SSE3 #define HAVE_SSSE3 #define HAVE_SSE4_1 #define HAVE_SSE4_2 #define HAVE_SSE4A #define HAVE_AVX #define HAVE_FMA3 #define HAVE_CFLUSH #define HAVE_MISALIGNSSE #define HAVE_128BITFPU #define HAVE_FASTMOVU #define NUM_SHAREDCACHE 1 #define NUM_CORES 1 #define CORE_ZEN #define CHAR_CORENAME "ZEN" #define SLOCAL_BUFFER_SIZE 24576 #define DLOCAL_BUFFER_SIZE 32768 #define CLOCAL_BUFFER_SIZE 12288 #define ZLOCAL_BUFFER_SIZE 8192 #define GEMM_MULTITHREAD_THRESHOLD 4
Note that in particular L1_CODE_SIZE was reset to 32KB. Also some of the L?_ASSOCIATIVE values have changed.
Looking at /usr/local/include/openblas_config.h, which was generated during the installation, has copied the entries of config.h generated during compilation (i.e. it has the right L1_CODE_SIZE of 64KB).
I have not observed any performance improvement from the modification of the caches. But I wonder whether the changes to config.h (L1_CODE_SIZE) during installation may have adverse effects on performance.
I will do more tests using HASWELL targets instead of ZEN in the development version and try to find out about loop unrolling.
I suspect as you ran the make install without repeating the previous TARGET=ZEN argument it re-ran getarch, and that seems to misdetect at least the L1 code size. As no part of the library got rebuilt, and the correct version of openblas_config.h got installed this should not cause any problems.
Edited to add: there does not appear to be any use of L1_CODE_SIZE in the code anyway, even L1_DATA_LINESIZE appears to be used by a few old targets only. It would seem that the only hardware parameters to get right would be DTB_ENTRIES (used in level2 BLAS loop unrolling and POTRF) and L2_DATA_SIZE (used for buffer allocation in driver/others/memory.c) . Both seem correct in what you wrote above.
@martin-frbg Your guessed correctly. Including the TRAGET=ZEN argument in the installation let to the right entries in config.h after installation and this did not improve performance.
BTW this was also the case in the original LibGoto - L2 size used only to derive xGEMM_P parameters for Core2, Opteron and earlier, L1 and L3 size apparently unused. Seems most of the hardware parameters are detected and reported "just in case" now, but perhaps cpu development has stabilized in the sense that a given type id will no longer have variants that vary in L1 or L2 properties. (There is one oddity in l2param.h where it uses L1_DATA_LINESIZE to determine an offset, but again your value appears to be correct already.)
Somewhere deep in Wikipedia it is said that zen-epic-ripper changes cache from write-through to write-back. That may mean that effective cache easily halves.
Somehow I doubt that, or at least its relevance for the detected cache sizes. On the other hand I feel there is a need to find out which of the many parameters in param.h and elsewhere that are faithfully copied for new cpu models are actually used in the current code, and how critical their influence is. Starting from the fragment of the param.h change I linked to above, it looks to me that SNUMOPT is completely unused, and DNUMOPT has a single appearance in a debug statement where it seems to be part of a theoretical efficiency factor for the syrk function.
I have been looking a bit more at the suspected performance shortage on the AMD Threadripper 1950X, its reasons and the consequences of thread-oversubscription on the TR1950X.
-
Regarding the previously reported 20% lower performance of the TR1950X compared to the i7-4770 using 1, 2 and 4 threads, I need to correct that this was for the total runtime of my code for one particular example. For dposv, dsyrk and zgbtrf using openblas 0.2.19, 0.2.20 and 0.3.30, the i7-4770 needs about 30-35% less time than the TR1950X in my example. I stumbled across a document (seemingly from AMD) on this webpage http://32ipi028l5q82yhj72224m8j.wpengine.netdna-cdn.com/wp-content/uploads/2017/03/GDC2017-Optimizing-For-AMD-Ryzen.pdf recommending strongly to avoid software prefetch on Ryzen platforms, because this would prevent loop unrolling. The test example written in C and presented in the pdf reports a 15% speedup by not prefetching. However, the presented example is for compilation using Microsoft Visual Studio, and it is for this setup that software prefetching prevents loop unrolling. Do you think this might be a potential problem in openblas using compilation with gcc or is there hardcoded loop unrolling in openblas which would not be susceptible to this? @brada4: On the TR1950X, I monitored the clock speeds on all active cores on /proc/cpuinfo and under load they all seem to run at 3.75 GHz with little variation (base frequency of TR1950X is 3.4 GHz). Under load the frequencies of the i7-4770 were 3.5 to 3.9 GHz (according to i7z_GUI). So this is not a big difference, it seems.
-
Using 8-16 threads on the TR1950X, I observed a version-related speedup of 2-5% in parallelised dsyrk and dposv for each of these numbers of threads, when switching from v. 0.2.19 to v. 0.2.20 or the development version. For 1 thread, there was no difference. For 4 threads, the improvement was at best 1%. For ZGBTRF and ZGBTRS, I cannot make an educated statement about possible improvements between the versions because of my self-inflicted thread oversubcription (cf. #1425 and below). However, selecting a recent openblas version has advantages unless one oversubscribes on the threads (next point).
-
Regarding the thread-oversubscription (calling parallelised ZGBTRF and ZGBTRS from within an OpenMP parallelised loop in my code), I run my code linked to openblas 0.2.19, 0.2.20 and the development version. There is an interesting behaviour for the total run-time of this loop with respect to the number of threads and the openblas version:
no. threads runtime 0.2.19 [s] runtime 0.2.20 [s] runtime devel [s] 1 35 32.5 32.5 4 11.7 11.9 11.8 8 8.3 14.5 25.3 12 6.1 20.3 32.3 16 6.1 24.3 40.1
These numbers are for compilation of openblas and my code using gcc/gfortran v. 7.2 with optimisation level -O2 (-O3 reduces the run times by about 1 s). So, up to 4 threads the performance is comparable, and the thread-oversubscription does not really seem to play a big role. For a larger number of threads, v.0.2.19 still sees a decent speedup, but for versions 0.2.20 and devel, there is clear detoriation in performance when opting for a higher number of threads. Note that these examples where run after correcting lapack-netlib/SRC/Makefile of versions 0.2.20 and devel and recompiling (cf. #1425). So, I get correct results in all of these tests.
Since there are general improvements in speed when selecting a recent openblas version, I would want to get over the problem with thread-oversubscription. However, I would need sequential versions of ZGBTRF and ZGBTRS inside the aforementioned loop and parallelised versions of DSYRK, DPOSV, etc in other parts of the code. This seems to require compilation of sequential and parallel versions of the openblas library, linking to these and then somehow picking the right (sequential or parallel) versions of the required BLAS and LAPACK routines. Now, if openblas came as a Fortran module, this would be a no-brainer, because one could just use the "use ..., only :: ... => ..." mechanism of Fortran to restrict import of symbols and to re-name them. I have been searching the net for possible linker-related solutions that provide similar mechanisms, but to no avail. Do you have a document or webpage with a solution to this at hand?
Some thoughts -
- prefetch should be addressed automatically by gcc for C and Fortran. The Haswell assembly files have a bunch of prefetcht0 instructions that may need looking at.
- Need to look again at what these two functions call internally.
- There have been both LAPACK version updates and pthread safety fixes between each pair of releases, and develop has new GEMM thread sheduling from #1320. In addition, develop has a less efficient AXPY at the moment due to #1332 (which would be easy to revert). But offhand there is nothing I would immediately blame the observed and rather serious loss of performance on. (This may warrant another git bisect...)
Fine, I will run git bisect on Monday.
Thanks. BTW there is a utility function openblas_set_num_threads() but according to #803 not all parts of the code honor it.
ZGBTRF did not change between LAPACK 3.6.1 and 3.8.0, and is mostly IZAMAX + ZSCAL + ZGERU + ZTRSM + ZGEMM. Nothing I'd immediately identify as having undergone any drastic changes since 0.2.19. (Except the post-0.2.20 GEMM thread scheduling mentioned above).
Perhaps a good starting point would be 9e4b697 - a few months after 0.2.19, and shortly before both a big LAPACK update (with risk of collateral damage) and a set of thread safety fixes. If all is well up to that point, it should still display the good performance of 0.2.19. Next would be something like 99880f7 , with my thread fixes in place and the LAPACK update stabilized, immediately before the first attempt at adding Zen support. (You will need the lapack Makefile fix from here on). Then fa6a920 - a few weeks before 0.2.20, nothing significant should have changed in between, and 00c42dc, well past 0.2.20 and shortly before the rewrite of the GEMM thread scheduler.
I have finished bisecting between 0.2.19 and 0.2.20 to find the cause of the performance degradation reported in point 3 (see above) when ZGBTRF and ZGBTRS are called from within an OpenMP parallelised loop. The resulting output is:
87c7d10b349b5be5ba2936bfedb498fe4f991e25 is the first bad commit commit 87c7d10b349b5be5ba2936bfedb498fe4f991e25 Author: Martin Kroeker [email protected] Date: Sun Jan 8 23:33:51 2017 +0100
Fix thread data races detected by helgrind 3.12
Ref. #995, may possibly help solve issues seen in 660,883
:040000 040000 9f41e2cd82dc83e84b65d32000d6341cc7e417a8 bcd37e483226009ef28ac179f7268fe419e0b73d M driver
As already reported in point 3 (see above), there was an additional performance degradation between 0.2.20 and the development version. Would you like to have the bisection results on that, too, or shall we see whether fixing the problem between 0.2.19 and 0.2.20 removes the additional degradation?
This is bad, as it means we will probably need someone more experienced with thread programming than me to improve this. :-(
(There must be millions of such people, but seeing that my PR went in unchallenged probably none of them on this project) In the worst case, we may be stuck with a decision between fast or safe code.
At least it should be only one of the two files affected by that PR that plays a role here - in an OpenMP build, blas_server_omp.c replaces blas_server.c so we should have only my changes in memory.c to worry about. As they were driven by helgrind reports I still believe they were substantially correct, but perhaps they can simply be made conditional on ifndef OPENMP
Given your find, it is entirely possible that the later degradation is from #1299 where I touched the same files again, nine months older but probably no wiser. (A brute force test could be to drop the 0.2.19 memory.c into current develop and see if this restores the original performance - I think it would still compile despite some intervening changes)
I have copied driver/others/memory.c from 0.2.19 to the development version and recompiled successfully. This has restored the good performance observed in 0.2.19.
I have now merged a (supposed) fix that uses all the additional locks only in multithreaded builds that do not employ OpenMP. This should restore pre-0.2.20 OpenMP performance without completely reverting the thread safety fixes, hopefully something can be done about their (probable) performance impact on pure pthreads builds in the future. (Though possibly the impact was worst with OpenMP, if the change was adding another layer of locks on top of what OpenMP already imposed)
The new version does not deliver correct results when compiled with
make TARGET=ZEN USE_OPENMP=1 BINARY=64 COMMON_OPT='-O2 -march=znver1 -mtune=znver1' FC=gfortran make TARGET=ZEN USE_OPENMP=1 BINARY=64 COMMON_OPT='-O2 -march=znver1 -mtune=znver1' FC=gfortran PREFIX=/usr/local install
and running more than 1 thread.
Sorry. Seems I added ifdefs around a few locks that were there unconditionally in 0.2.19. Somehow none of the standard tests was able to flag this on my hardware.
Sorry, I still get wrong results.
Hmm, thanks. I had missed one spot near line 1120 that had a blas_unlock from the earlier version still commented out, hopefully this was causing it. Apart from that, there are only cases where I had to move a lock outside an if() block that uses a thread variable in the conditional - I do not see a functional difference but can duplicate the offending code block if necessary.
I have recompiled, but still get wrong results except for when only 1 thread is used.
- Using 2 and 4 threads, I just get wrong results.
- Using 8 and 12 threads, the computations get so far off that IEEE_DIVIDE_BY_ZERO floating-point exceptions are signalling from my own code.
- Using 16 threads, I even got an output from somehwere in ZGBTRF or ZGBTRS: BLAS : Bad memory unallocation! : 64 0x7f775d046000
Hopefully, the latter message can help to locate the problem.
I have reverted the previous commit for now. While that "bad unallocation" message does originate from memory.c, I do not understand how the revised version of my changes could be causing it.
Unfortunately I cannot reproduce your problem with any of the tests, not with the software I normally use. BLAS-Tester suggests there is currently a problem with TRMV, but this is unrelated to the version of memory.c in use (and may be fallout from attempts to fix #1332)
Do you think there are any further test I could do or would you recommend to just copy driver/others/memory.c from 0.2.19 to the development version to get higher performance (despite the thread safety issues in this older version)?
I am about to upload another PR where absolutely all locking-related changes will be encapsulated in if(n)def USE_OPENMP . If that still does not work for you, some simplified testcase will be needed.
With the updated memory.c, the results are fine, but the run-time performance is as degraded as in v. 0.2.20
Weird. I believe a side-by-side comparison of the old and new memory.c will show that they are now functionally equivalent (with respect to locking) for USE_OPENMP=1. Did you do a full rebuild (starting from make clean) ?
Running cpp -I../.. -DUSE_OPENMP -DSMP on both "old" and "new" memory.c definitely leads to functionally equivalent codes with just a few shuffled lines. The only other major difference between the codes is my addition of cgroup support in get_num_procs (for #1155, see PR #1239 for the actual code change in memory.c), perhaps you could try commenting that one out as well.
Sorry for the late response! Running the suggested cpp command, the changes that I seem to see are
- changes in pthread_mutex_(un)lock,
- new defs of pthread_mutex_t
- some "if (!memory_initialized) {" statements from 0.2.19 being reshuffled,
- some blas_(un)lock being replaced by pthread_(un)mutex_lock,
- one blas_goto_num and blas_omp_num removed
In both cpp-ed versions of memory.c #include prepocessor directives are replace by numeric codes after #.
Are you comparing the memory.c of 0.2.19 to the lastest from the as-yet unmerged PR ? There the cpp-processed files should show no replacements of blas_(un)lock by pthread_mutex_(un)lock and no new defines of pthread_mutex_t (only the definition of the alloc_lock moves up some 200 lines, but all new uses should be filtered out by the "ifndef USE_OPENMP").
The deletion of blas_goto_num and blas_omp_num is from an unrelated recent code cleanup patch.
All other lines should be the same, except for the two conditionals in blas_memory_free() swapping places - "position < NUM_BUFFERS" now being checked before position is used as an address into the memory array (from #1179 - unfortunately git blame shows the wrong info here, attributing it to the reversion of an unrelated change).
You were right. It seems I looked at the wrong version of memory.c by just taking the latest development version. Sorry for that! I have now downloaded memory.c from #1468 leading to https://github.com/martin-frbg/OpenBLAS/blob/7646974227a51a6c9adc9511593f5630f8fb59ee/driver/others/memory.c Please confirm that this is the right version to look at. Using this version everything seems fine. I get the right results and the run-times show a slight improvement over those for openblas 0.2.19 and gcc 7.2
Referring to point 3 in the list above, the run-times for the OpenMP parallelised loop calling parallelised ZGBTRF and ZGBTRS in my code are now as follows
no. threads runtime 0.2.19+gcc7.2 [s] runtime devel+gcc8.0 [s] 1 35 31.7 4 11.7 11.7 8 8.3 8.1 12 6.1 6.1 16 6.1 5.9
In this specific example, the loop does 33 iterations with equal work load and, hence, no improvement can be expected by going from 12 to 16 threads.
Anyhow, even using gcc 8.0 and the latest development version with the PR version of memory.c does not bring the performance close to that of an i7-4770 for 1 to 4 threads (point 1 in list above). Do you think there would be any value to try ATLAS to automatically identify potentially optimal settings for Threadripper processors and, then, import these findings into openblas?