OpenBLAS icon indicating copy to clipboard operation
OpenBLAS copied to clipboard

Parallelized dgetrf doesn't detect matrix as singular

Open jmtilli opened this issue 1 year ago • 9 comments

I have the following matrix:

10,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,10,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,10,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,5.6353902507748659,0,0,-5.6353902507748659,0,0,0,0,0,0,0,
0,0,0,0,5.6353902507748659,0,0,-5.6353902507748659,0,0,0,0,0,0,
0,0,0,0,0,5.6353902507748659,0,0,-5.6353902507748659,0,0,0,0,0,
0,0,0,-5.6353902507748659,0,0,15.635391250774864,0,0,-10,-9.9999999999999995e-07,0,0,0,
0,0,0,0,-5.6353902507748659,0,0,25.635391250774866,0,-10,-10,-9.9999999999999995e-07,0,0,
0,0,0,0,0,-5.6353902507748659,0,0,25.635390250774865,-10,0,-10,0,0,
0,0,0,0,0,0,-10,-10,-10,30,0,0,0,0,
0,0,0,0,0,0,-9.9999999999999995e-07,-10,0,0,20.000001999999999,-10,-9.9999999999999995e-07,0,
0,0,0,0,0,0,0,-9.9999999999999995e-07,-10,0,-10,30.001000860019595,0,-10.000999860019597,
0,0,0,0,0,0,0,0,0,0,-9.9999999999999995e-07,0,5.635391250774866,-5.6353902507748659,
0,0,0,0,0,0,0,0,0,0,0,-10.000999860019597,-5.6353902507748659,15.636390110794462,

...which should be singular. At least getting its inverse in GNU Octave shows that it's singular to machine precision.

If I try to do LU decomposition for it using OpenBLAS:

#include <stdio.h>

#define lapack_int int
#define LAPACK_dgetrf dgetrf_
void LAPACK_dgetrf(const lapack_int*, const lapack_int*, double*, const lapack_int*, lapack_int*, lapack_int*);

double M[] = {
10,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,10,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,10,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,5.6353902507748659,0,0,-5.6353902507748659,0,0,0,0,0,0,0,
0,0,0,0,5.6353902507748659,0,0,-5.6353902507748659,0,0,0,0,0,0,
0,0,0,0,0,5.6353902507748659,0,0,-5.6353902507748659,0,0,0,0,0,
0,0,0,-5.6353902507748659,0,0,15.635391250774864,0,0,-10,-9.9999999999999995e-07,0,0,0,
0,0,0,0,-5.6353902507748659,0,0,25.635391250774866,0,-10,-10,-9.9999999999999995e-07,0,0,
0,0,0,0,0,-5.6353902507748659,0,0,25.635390250774865,-10,0,-10,0,0,
0,0,0,0,0,0,-10,-10,-10,30,0,0,0,0,
0,0,0,0,0,0,-9.9999999999999995e-07,-10,0,0,20.000001999999999,-10,-9.9999999999999995e-07,0,
0,0,0,0,0,0,0,-9.9999999999999995e-07,-10,0,-10,30.001000860019595,0,-10.000999860019597,
0,0,0,0,0,0,0,0,0,0,-9.9999999999999995e-07,0,5.635391250774866,-5.6353902507748659,
0,0,0,0,0,0,0,0,0,0,0,-10.000999860019597,-5.6353902507748659,15.636390110794462,
};

int main(int argc, char **argv)
{
        const int n = 14;
        int ipiv[14];
        int info = 0;
        LAPACK_dgetrf(&n, &n, M, &n, ipiv, &info);
        printf("Info %d\n", info);
        return 0;
}

...it prints info 0. However, if I set the environment variable OPENBLAS_NUM_THREADS to 1, it prints 14.

So it incorrectly detects the matrix as decomposable if I run it with many threads, and only detects it correctly as singular if I run it with one thread. The separator seems to be 3 vs 4 threads: with 3 threads it prints info 14 but with 4 threads it prints info 0.

I have OpenBLAS 0.3.8. I understand that in newer versions of OpenBLAS, small matrices are always handled with only one thread so it improves performance, so testing the issue with a new OpenBLAS isn't even possible. However, since the calculation happens differently depending on the number of threads, I think there's an issue somewhere that should be tracked down. I would expect OpenBLAS to give identical results no matter how many threads are in use.

jmtilli avatar Feb 18 '24 08:02 jmtilli

I cannot immediately reproduce this with 0.3.26 (after disabling the code that enforces single-threading of course), what is your hardware please ?

martin-frbg avatar Feb 20 '24 22:02 martin-frbg

The hardware is a VirtualBox virtual machine running Ubuntu 20.04 LTS on top of Windows 11 host. CPU is Intel Core i7-10510U and hardware virtualization is being used.

jmtilli avatar Feb 21 '24 15:02 jmtilli

Thank you - that should be Haswell as far as OpenBLAS is concerned (unless your vbox setup hides the avx2 capability), I cannot reproduce the problem there although I cannot immediately point to a particular code change that would have fixed it

martin-frbg avatar Feb 21 '24 17:02 martin-frbg

Appears to be a loss of precision somewhere in the Sandybridge kernels...

martin-frbg avatar Feb 21 '24 18:02 martin-frbg

Here are the cpuinfo flags:

flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single fsgsbase avx2 invpcid rdseed clflushopt md_clear flush_l1d arch_capabilities

...so it doesn't hide the avx2.

jmtilli avatar Feb 21 '24 18:02 jmtilli

The old version may have used a different GEMM kernel, and GETRF behaviour depends on its unroll factor - which seems to explain the discrepancy I've found with Sandybridge. In one case, the workload gets split up again where it appears the problem comes in.

martin-frbg avatar Feb 22 '24 11:02 martin-frbg

Interestingly, the plain unoptimized Fortran code of the "netlib" Reference-LAPACK implementation does not recognize this particular matrix as singular either. And so far I have not come across any other (trivially) singular matrix that similarly slips through either codes. This is looking more and more like a weird corner case of machine precision (and/or compiler code generation) to me.

martin-frbg avatar Feb 29 '24 18:02 martin-frbg

Maybe this is more like an issue due to limited floating point precision rather than a real problem. I understand that in parallel versions, calculations can happen in different order and the usual mathematical rules of associativity for example do not necessarily apply. I encountered the matrix on a circuit nodal analysis problem that was incorrectly presented, and it should indeed be singular, since the problem presentation was incorrect and important elements were missing from the matrix.

jmtilli avatar Mar 01 '24 05:03 jmtilli

Probably yes - and it is interesting that even the simple single-threaded Fortran code of the Reference LAPACK does not treat this matrix as singular on my machine, so compiler version or the math libraries on your system may play a role here as well. Not sure if I want to calculate the determinant by hand, but I don't think I see any obviously interdependent rows or columns that would make it unequivocally singular. (Octave probably uses either OpenBLAS or the Reference BLAS/LAPACK so is no independent proof)

martin-frbg avatar Mar 01 '24 10:03 martin-frbg