OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Issue with LAPACKE_sgesvd() on custom compiled v0.3.7 for Win64 (new x86_64-w64-mingw32 might be the cause)

Open Arech opened this issue 6 years ago • 66 comments

Hi there.

I have a wrapper over LAPACKE_sgesvd() that works well with supplied binary v0.2.19/20, custom compiled v.0.2.20 and supplied binary v0.3.7. However, the code doesn't work well with v0.3.7 compiled (same options as I did for v0.2.20) on a fresh Debian10 with fresh compilers CC=x86_64-w64-mingw32-gcc FC=x86_64-w64-mingw32-gfortran (I'll describe later details of the compilation process)

Testing environment: Windows 7 x64 with latest updates on AMD Phenom II X6 1090.

The code in question is the same as the following python script:

    def sample(self, shape):
        if len(shape) < 2:
            raise RuntimeError("Only shapes of length 2 or more are "
                               "supported.")

        flat_shape = (shape[0], np.prod(shape[1:]))
        a = get_rng().normal(0.0, 1.0, flat_shape)
        u, _, v = np.linalg.svd(a, full_matrices=False)
        # pick the one with the correct shape
        q = u if u.shape == flat_shape else v
        q = q.reshape(shape)
        return floatX(self.gain * q)

It takes random N(0,1) matrix and performs SVD on it. Here's first few floats of sample input (colmajor matrix 64*785):

0x000000000D4C00C0      -0.909242570     -0.741646349     -0.169360474     -0.177789196  
0x000000000D4C00D0      -0.341049701     -0.345561802     -0.421100467     -0.359291792  
0x000000000D4C00E0     -0.0570527203       1.12855971      -1.45928419      0.384212315  
...

All tested versions of LAPACKE_sgesvd() works great, except custom compiled v0.3.7, which despite returning success (0), outputs the following junk:

0x000000000D4C00C0   -3.40282347e+38  -3.40282347e+38      0.000000000      0.000000000  
0x000000000D4C00D0    3.40282347e+38   3.40282347e+38  -3.40282347e+38   3.40282347e+38  
0x000000000D4C00E0   -3.40282347e+38      0.000000000      0.000000000  -3.40282347e+38  
...

Or in DWORDS

0x000000000D4C00C0  ff7fffff ff7fffff 00000000 00000000  
0x000000000D4C00D0  7f7fffff 7f7fffff ff7fffff 7f7fffff  
0x000000000D4C00E0  ff7fffff 00000000 00000000 ff7fffff  
...

Note, that I had to do custom compilation, because the supplied binary still doesn't use the CONSISTENT_FPCSR=1 switch and I eventually get a lots of NaNs that seriously slows computation down.


The compilation process is basically the same as described in the issue linked above ( #1237 ). I installed fresh Debian10 on a virtual machine, did all the boilerplates

apt-get update
apt-get upgrade
apt-get install make cmake gcc mingw-w64 gfortran-mingw-w64

and then ran

make clean
make DYNAMIC_ARCH=0 CONSISTENT_FPCSR=1 CC=x86_64-w64-mingw32-gcc FC=x86_64-w64-mingw32-gfortran HOSTCC=gcc NUM_THREADS=6 TARGET=BARCELONA PREFIX=/opt/OpenBLAS
make install

to obtain kind of distro in /opt/OpenBLAS. Then I copy the contents of /opt/OpenBLAS to the windows system, compile my project over it, copy libopenblas.dll to my exe's folder and get the issue with LAPACKE_sgesvd() when I run my code.

Note, that I haven't found any issues with CBLAS routines I use (mainly gemm, syrk and symm). Moreover, I'm glad to see some performance improvement over the older v0.2.20.

I've noticed one suspicious difference between the supplied binary libopenblas.dll and my compiled version. The supplied binary depends on libgfortran-3.dll and works great with very old version of this lib dated 21.10.2014 (AFAIR I got it from some .zip from sourceforge's project page long ago). However, the custom compiled version depends on libgfortran-5.dll file, which I had to take (with all other necessary .dll dependencies) from debian's installation folder /lib/gcc/x86_64-w64-mingw32/8.3-win32.

Any ideas how to fix the issue?

Probably it worth trying to change the compiler to some older version, however, I'm not aware how to do it (I'm a foreigner in the Linux world). Could someone please explain it a little if the idea is worth trying?

Arech avatar Oct 30 '19 07:10 Arech

You are correct - one has to copy gfortran redist from cross-build system, that -3 and -5 is internal ABI version, they are not interchangeable.

_FPSCR flag could be planted into single-arch builds where it improves things a lot - thats essentially trading strict IEEE float conformance to significant performance on old to very old gear. ??? @martin-frbg ???

Does anything change if you use TARGET=OPTERON or OPTERON_SSE3 ? Other side of _FPCSR build flag? Setiing OMP_NUM_THREADS=1 and OPENBLAS_NUM_THREADS=1 before run?

Between mentioned releases of OpenBLAS ang GCC (You got 8.2.0?) - default gfortran ABI changed slightly, and fresh GCC got more picky about assembly registers.

brada4 avatar Oct 30 '19 08:10 brada4

You got 8.2.0?

No, it's 8.3

root@debian:~/my_dev/OpenBLAS-0.3.7# x86_64-w64-mingw32-gcc --version
x86_64-w64-mingw32-gcc (GCC) 8.3-win32 20190406
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@debian:~/my_dev/OpenBLAS-0.3.7# x86_64-w64-mingw32-gfortran --version
GNU Fortran (GCC) 8.3-win32 20190406
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

root@debian:~/my_dev/OpenBLAS-0.3.7# gcc --version
gcc (Debian 8.3.0-6) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

I'll try other things you've mentioned and post updates later.

Arech avatar Oct 30 '19 08:10 Arech

Setiing OMP_NUM_THREADS=1 and OPENBLAS_NUM_THREADS=1 before run?

No change

TARGET=OPTERON

No change

/* BTW, TARGET=OPTERON build do have

#define OPENBLAS_HAVE_3DNOW 
#define OPENBLAS_HAVE_3DNOWEX

defined in \include\openblas_config.h, while TARGET=BARCELONA doesn't, though it seems that these proccessors do have support for 3Now & 3DNowExt instructions sets. Shouldn't TARGET=BARCELONA have them defined too? */

TARGET=OPTERON_SSE3

No change

TARGET=BARCELONA CONSISTENT_FPCSR=0

Kind of no change. For the same input (rng is seeded with the same constant) outputs

0x000000000D5700C0         -nan(ind)        -nan(ind)        -nan(ind)        -nan(ind)  
0x000000000D5700D0         -nan(ind)        -nan(ind)        -nan(ind)        -nan(ind)  
0x000000000D5700E0         -nan(ind)        -nan(ind)        -nan(ind)        -nan(ind)  
...

or in bytes

0x000000000D5700C0  00 00 c0 ff 00 00 c0 ff 00 00 c0 ff 00 00 c0 ff  
0x000000000D5700D0  00 00 c0 ff 00 00 c0 ff 00 00 c0 ff 00 00 c0 ff  
0x000000000D5700E0  00 00 c0 ff 00 00 c0 ff 00 00 c0 ff 00 00 c0 ff  
...

Arech avatar Oct 30 '19 09:10 Arech

If you compile 0.2.20 with Debian 10 - does it work? Do you build numpy with same CC/FC options as OpenBLAS?

brada4 avatar Oct 30 '19 09:10 brada4

Do you build numpy with same CC/FC options as OpenBLAS?

I don't use neither numpy, nor python at all, it's purely C++ project. I mentioned the python script just to give an idea what kind of work gesvd() is doing in the project (python is easier to comprehend than C++ code I gave link to)

If you compile 0.2.20 with Debian 10 - does it work?

Have just downloaded the code, tried to compile: compilation breaks in the middle (and I don't know how to fix it - is it possible to make error messages more verbose? there's even no line number here...)

x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_G77 -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=6 -DASMNAME=cgemm_tr -DASMFNAME=cgemm_tr_ -DNAME=cgemm_tr_ -DCNAME=cgemm_tr -DCHAR_NAME=\"cgemm_tr_\" -DCHAR_CNAME=\"cgemm_tr\" -DNO_AFFINITY -I../.. -UDOUBLE  -DCOMPLEX  -c -UDOUBLE -DCOMPLEX -DTR gemm.c -o cgemm_tr.obj
x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_G77 -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=6 -DASMNAME=cgemm_cr -DASMFNAME=cgemm_cr_ -DNAME=cgemm_cr_ -DCNAME=cgemm_cr -DCHAR_NAME=\"cgemm_cr_\" -DCHAR_CNAME=\"cgemm_cr\" -DNO_AFFINITY -I../.. -UDOUBLE  -DCOMPLEX  -c -UDOUBLE -DCOMPLEX -DCR gemm.c -o cgemm_cr.obj
<command-line>: error: expected identifier or ‘(’ before numeric constant
x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_G77 -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=6 -DASMNAME=cgemm_rn -DASMFNAME=cgemm_rn_ -DNAME=cgemm_rn_ -DCNAME=cgemm_rn -DCHAR_NAME=\"cgemm_rn_\" -DCHAR_CNAME=\"cgemm_rn\" -DNO_AFFINITY -I../.. -UDOUBLE  -DCOMPLEX  -c -UDOUBLE -DCOMPLEX -DRN gemm.c -o cgemm_rn.obj
x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_G77 -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=6 -DASMNAME=cgemm_rt -DASMFNAME=cgemm_rt_ -DNAME=cgemm_rt_ -DCNAME=cgemm_rt -DCHAR_NAME=\"cgemm_rt_\" -DCHAR_CNAME=\"cgemm_rt\" -DNO_AFFINITY -I../.. -UDOUBLE  -DCOMPLEX  -c -UDOUBLE -DCOMPLEX -DRT gemm.c -o cgemm_rt.obj
x86_64-w64-mingw32-gcc -O2 -DMS_ABI -DMAX_STACK_ALLOC=2048 -Wall -m64 -DF_INTERFACE_G77 -DSMP_SERVER -DNO_WARMUP -DCONSISTENT_FPCSR -DMAX_CPU_NUMBER=6 -DASMNAME=cgemm_rc -DASMFNAME=cgemm_rc_ -DNAME=cgemm_rc_ -DCNAME=cgemm_rc -DCHAR_NAME=\"cgemm_rc_\" -DCHAR_CNAME=\"cgemm_rc\" -DNO_AFFINITY -I../.. -UDOUBLE  -DCOMPLEX  -c -UDOUBLE -DCOMPLEX -DRC gemm.c -o cgemm_rc.obj
make[1]: *** [Makefile:365: cgemm_cr.obj] Error 1
make[1]: *** Waiting for unfinished jobs....
make[1]: Leaving directory '/root/my_dev/OpenBLAS-0.2.20/driver/level3'
make: *** [Makefile:139: libs] Error 1
root@debian:~/my_dev/OpenBLAS-0.2.20#

Arech avatar Oct 30 '19 10:10 Arech

IIRC, the cgemm_cr problem with 0.2.20 was a "funny" quirk of mingw where it misinterpretes the CR define from the command line as a literal "carriage return" or something like that. Later versions have the define (and corresponding ifdef in the code) changed to "XCR" to work around this. No idea about the missing HAVE_3DNOW on Barcelona, could have been a simple oversight years ago but as far as I can tell it does not play a role in any of the BLAS kernels actually used by the BARCELONA target (except a prefetch instruction in the zgemm copy-out helper, but I doubt it would have much impact on performance). Most likely this reflects the transition from 3dnow to SSE instructions. Newer gcc/gfortran has an ABI issue where it expects the C code to pass length arguments even to single-character string arguments in calls to FORTRAN routines, but this should be taken care of in 0.3.7 by adding a compiler option (-fno-optimize-sibling-calls) automatically. The recent windows binaries were built with a version of the MXE cross-compiler environment that appears to be based on (mingw) gcc-5.5.0

martin-frbg avatar Oct 30 '19 10:10 martin-frbg

but this should be taken care of in 0.3.7 by adding a compiler option (-fno-optimize-sibling-calls) automatically.

Seems so. Tried to compile with СFLAGS="-fno-optimize-sibling-calls" CXXFLAGS="-fno-optimize-sibling-calls". It produced a .dll with the same size but some 6 bytes are different (2 WORDs in PE header and 1 WORD somewhere in the middle). No change...

MXE env looks intimidating :grin:

Arech avatar Oct 30 '19 11:10 Arech

0.3.7 integrates that fflag already, there will be no change. For what is important numpy build has to use that flag too.

brada4 avatar Oct 30 '19 12:10 brada4

IIRC, the cgemm_cr problem with 0.2.20 was a "funny" quirk of mingw where it misinterpretes the CR define from the command line as a literal "carriage return" or something like that. Later versions have the define (and corresponding ifdef in the code) changed to "XCR" to work around this.

v0.3.0 also fails at the same point. v0.3.1 compiled successfully, but no change for the issue when I run my code.

@brada4 Did you read my previous comment about python? Numpy is completely irrelevant to the issue.

And btw, just tried to compute on doubles instead of floats using my base v0.3.7. LAPACKE_dgesvd() works excellent, no problems here. I'd say, that the fact that LAPACKE_dgesvd() works as expected may hint, that the root cause of the issue may lie not only in compiler quirks, but in OpenBLAS's code too.

Arech avatar Oct 30 '19 12:10 Arech

oh... If I'd only knew ahead that it'd take almost two days to switch to a newer version... But finally I made it work. I've just managed to install mingw-w64 from previous debian release. Compiler from there have version 6.3.0 and links to the same -3 fortran ABI .dll... Now both functions, - single precision LAPACKE_sgesvd() and double precision LAPACKE_dgesvd(),- works as expected.

Arech avatar Oct 30 '19 14:10 Arech

It is unlikely fortran issue. It is the new gcc using unmarked registers through assembly code at least in v9 Before it tried to scan assembly section and detect what registers are used inside. EDIT Numpy SVD uses [sd]gesdd - does that fail just as blatantly as _gesvd ? It is just one function losing one register somewhere inside, so far kernels differing between BARCELONA and OPTERON are "cleared" , still like 80 left, too much to look through manually.

brada4 avatar Oct 30 '19 19:10 brada4

Have now learned that it is possible to rebuild MXE with a more recent gcc, will provide an updated windows package (or possibly two, with and without the FPCSR thingy) when I find the time. The fortran ABI issue is indeed mostly conjecture - there have been posts in other projects reporting mysterious hard to reproduce crashes that went away when a or the workaround was employed. Not sure I understand the comment about "unmarked registers", hopefully all the earlier cases of wrong constraints and writes to input-only registers have been addressed in recent releases.

martin-frbg avatar Oct 30 '19 21:10 martin-frbg

@martin-frbg

Have now learned that it is possible to rebuild MXE with a more recent gcc, will provide an updated windows package (or possibly two, with and without the FPCSR thingy) when I find the time.

That would be a good thing, if it's not very burdensomely for you.

Regarding the issue... I personally think that it's very important, that the issue is easily reproducible, because it makes much easier to find the root cause of the issue. If it's actually an issue in OpenBLAS - it could easily be fixed. If the compiler is to blame, - it would very beneficial for the whole society if a reproducible compiler bug report will be created.

Should I make a short isolated code to reproduce the issue exactly as I saw it, so some of you (who are sufficiently familiar with the OpenBLASs code) could debug the library and find the real cause?

Arech avatar Oct 31 '19 06:10 Arech

If it is not too much trouble for you, an isolated test code would be great - that way it could hopefully be established if this is a bug in OpenBLAS or in recent mingw ports of gcc.

martin-frbg avatar Oct 31 '19 07:10 martin-frbg

No problem, I'll post it soon

Arech avatar Oct 31 '19 07:10 Arech

@martin-frbg please take a look into rep https://github.com/Arech/sgesvd_tester

Note that actually in a conventional FP mode sgesvd() from a buggy binary is able to catch and return an error (however, it'll still produce NaNs in output). It will silently return success with a junk in output when FP rounding mode was set to "round towards zero". Proper binary works great in both modes.

Feel free to ping me if you need some more info/help.

Arech avatar Oct 31 '19 10:10 Arech

@Arech and how would reference BLAS and MKL react to your FPU compliance diversions?

brada4 avatar Oct 31 '19 14:10 brada4

@brada4 Andrew, I'd like you to answer me the following two questions first before I answer yours:

  1. What source link you have to support your claim about "compliance diversions"?
  2. Did you notice that the issue appears in absolutely standard FPU state using compiler v8.3 and doesn't appear for any other compiler?

Arech avatar Oct 31 '19 15:10 Arech

Your code changes FPU flags only on calling thread, others remain in default state, you need to change FPCSR code in blas_server_win32.c to propagate your setting to all threads, that is limitation of mingw32. Maybe then it works properly with strange rounding modes too, certainly they produce different results from standard conditions in all cases.

brada4 avatar Oct 31 '19 17:10 brada4

you need to change FPCSR code in blas_server_win32.c to propagate your setting to all threads, that is limitation of mingw32.

Makes me wonder if (the effect of) setting CONSISTENT_FPCSR=1 could/should be done automatically at compile time if defined(__MINGW32__) (Still have to reread #1237, but back then I was probably more respectful of what I assumed to be conscious coding decisions taken by earlier authors)

martin-frbg avatar Oct 31 '19 21:10 martin-frbg

MKL propagates user-set FPU config to all threads. I am puzzled why here even single-threaded case was failing. Slowness is a concern only on older cpus: https://www.cs.uaf.edu/~olawlor/papers/2005/denormal/lawlor_denormal_2005.pdf , the absent FPCSR (and respective other SSE register) propagation is pain on windows. What i found strange is repeater that does not match problem statement and obviously flawed RNG up there.

brada4 avatar Nov 01 '19 04:11 brada4

@brada4

Your code changes FPU flags only on calling thread, others remain in default state, you need to change FPCSR code in blas_server_win32.c to propagate your setting to all threads...

Now I think I got you idea. If I understand it correctly, there is a possibility that:

  • in case of non-standard FPU config (rounding to zero) sgesvd() fails to catch convergence error because of different FP handling in worker threads (and it has a right to do so), and that is the reason why it returns success while actually it's an error with all accompanying junk in output vars
  • in case of standard FPU config sgesvd() is able to catch an error, that is why it returns it. Different sgesvd() behaviour on different compiler versions (no error on 6.3, error on 8.3) may happen not due to some issues with OpenBLAS or with the new compiler, but because new compiler may legitimately reorder some math instructions (if compiled with non-strict -fmath switch /*it is so in reality?*/) and therefore code may produce slightly different result, and that is enough to generate a convergence error with single point precision while double is still ok.

Well... that is fair point. I was going to test it, but

I am puzzled why here even single-threaded case was failing.

Indeed. Setting env variables OMP_NUM_THREADS=1 and OPENBLAS_NUM_THREADS=1 changes nothing... I even pushed an update to the test that confirms the state of this variables and double checked in debugger that OpenBLAS is indeed single threaded.


Now, regarding non standart FPU state and if it's necessary to propagate to worker threads...

Unfortunately, I don't remember exactly, why I chose to use this setting in my main codebase. According to a comment, it seems that it was intended to prevent NaNs from occurring in vectorized form of ::std::exp() function (that is possibly a quirk of either the compiler or my hardware, because denormals were already disabled at that moment, but NaNs kept occurring from ::std::exp() anyway)..

Therefore as long as OpenBLAS functions does not produce NaNs (and it seems that setting CONSISTENT_FPCSR=1 is enough to achieve it) - there's totally no need to push that non-conventional FPU config to it at least for my task. It is actually my bug that I forgot to restore FPU state back to normal before calling OpenBLAS, and I'm going to fix it.

So, to reiterate, for me personally (don't know about other use-cases) there's no need for OpenBLAS to support non-conventional FPU config's as long as it doesn't produce denormal numbers.

What really bothers me is that even in totally standard FPU config sgesvd() always converges when it was compiled with v6.3 and always fails when it was compiled with v8.3.

Any ideas why it happens?

Arech avatar Nov 01 '19 07:11 Arech

This injustice is not fixable. Just like netlib blas here IEEE754 conformant FPU is expected, with all NaN signalling etc. For those caring less denormals can be disabled. You know rounding here or ther changes away from standard behaviour, and gives different results.

brada4 avatar Nov 01 '19 21:11 brada4

@brada4 What "this injustice" exactly you're talking about?

@martin-frbg Did you understand Andrew's point? Do you agree with him?

Arech avatar Nov 02 '19 10:11 Arech

Probably something lost in translation... @brada4 could you rephrase your comment ? But part of the problem would likely be that netlib LAPACK/LAPACKE is not yet fully prepared to deal with NaN, so perhaps deficiency or incapability was meant ?

martin-frbg avatar Nov 02 '19 10:11 martin-frbg

Things like -ffast-float were mentioned, thats much worse than having just denormals wiped away. Then example of random input - obvious increments. Then code programming FPU rounding modes.... Not sure which breaks LAPACK and which OpenBLAS....

There is shortcoming that FPU modes are not distributed pn each call to threads like MKL does.

brada4 avatar Nov 02 '19 16:11 brada4

Have same problem, can I somehow obtain the configuration/flags of the MXE environment that @martin-frbg uses?

Dr-Desty-Nova avatar Feb 20 '21 11:02 Dr-Desty-Nova

Same problem to which?

brada4 avatar Feb 20 '21 14:02 brada4

Same problem as in OP post. Custom compiled 0.3.7 produces weird results in _sgesvd(). Tried 0.3.12, same issue. MinGW version 9.3.

Flag set:

TARGET=NEHALEM
BINARY=64
USE_THREAD=0
USE_LOCKING=1
NUM_THREADS=200
HOSTCC=gcc 
CC=x86_64-w64-mingw32-gcc
FC=x86_64-w64-mingw32-gfortran

Dr-Desty-Nova avatar Feb 20 '21 14:02 Dr-Desty-Nova

My builds use NUM_THREADS=64 DYNAMIC_ARCH=1 DYNAMIC_OLDER=1 TARGET=CORE2 CONSISTENT_FPCSR=1 , of which probably the last is most relevant here. (Though I cannot exclude that the fairly old gcc 5.5 used by default in MXE may play a role as well)

martin-frbg avatar Feb 20 '21 17:02 martin-frbg