fflas-ffpack icon indicating copy to clipboard operation
fflas-ffpack copied to clipboard

Parallel solve Modular<double> problem on retourdest with openblas

Open jgdumas opened this issue 5 years ago • 2 comments

on a Xeon(R) Gold 6126, with git current versions and g++-7.4.0 or g++-8.3.0 (no problem with blis though)

./test-solve Checking .......................Modular modulo 1531 ... seq: PASSED (0.011941) par: FAILED (0.099488)

or sometimes:

./test-solve Checking .......................Modular modulo 3307 ... seq: PASSED (0.0123701) par: BLAS : Bad memory unallocation! : 50 0x7fa3ae000000 BLAS : Bad memory unallocation! : 50 0x7fa3b8000000 BLAS : Bad memory unallocation! : 50 0x7fa3a8000000 BLAS : Bad memory unallocation! : 50 0x7fa39e000000 FAILED (0.124153)

or even:

./test-solve Checking ...................Modular modulo 24918701 ... seq: PASSED (0.079308) par: Error in Error in Error in -8.693182536e+15 <= Out[0, 0] = -8.693182536e+15 <= Out[96, 0] = -8.693182536e+15 <= Out[0, 0] = -1.090431442e+22 <= 24918700 Error in -4.856139493e+19 <= 24918700 Error in -2.066662931e+220-8.693182536e+15 <= Out[0, 0] = < = A[ <= 24918700 12, 12] =Error in -8.796068104e+12 <= 24918700 -7.439194272e+35 <= 24918700 Error in -8.693182536e+15 <= Out[4Error in -8.693182536e+15 <= Out[, 120] = , -8.773609098e+20-8.693182536e+15 <= 0Error in 0 < = A[9, 2 <= Out[48, 0] = -1.869075241e+49 <= ] =] = 24918700 Error in -8.693182536e+15 <= Out[48, 0] = -1.099486709e+1224918700 <= 24918700 9.261256994e+19 <= -1.56269226e+22 <= Error in 24918700 -8.693182536e+1524918700 <= Out[9, 0] = 2.733686785e+19 <= 24918700 Error in -8.693182536e+15 <= Out[0, 0] = -3.113550878e+21 <= 24918700 Error in -8.693182536e+15 <= Out[34, 0] = 1.458214448e+33 <= 24918700 Error in -8.693182536e+15 <= Out[20, 0] = 3.091138213e+16 <= 24918700 Error in 0 < = A[38, 40] =-8.796068104e+12 <= 24918700 Error in -8.693182536e+15 <= Out[24, 0] = -1.432320135e+22 <= 24918700 Error in Error in 0 < = A[146Error in Error in 0Error in 0 < = A[0 < = A[146, 12] =-8.796068104e+12 <= 24918700 Error in 0 < = A[146Error in Error in Error in 0 < = A[146Error in < = A[146, 12] =0 < = A[-8.693182536e+15 <= Out[146, 0] = , 12] =Error in 0 < = A[0, 1466.362464818e+19 <= 0 < = A[146, 12, -8.796068104e+12-8.796068104e+12 <= 146, , 12146, 24918700 24918700 <= < = A[146, 12] =-8.796068104e+12 <= Error in 24918700 12] =] =12] =24918700120 < = A[] =146, 1212] =Error in -8.796068104e+12 -8.796068104e+12 <= ] =-8.796068104e+12Error in 0-8.693182536e+15 <= Out[-8.796068104e+12 <= <= Error in <= -8.796068104e+12] =-8.796068104e+12 <= 24918700 < = A[146, 12] =24918700 -8.796068104e+12 <= 24918700 Error in 24918700 -8.796068104e+12 <= 24918700Error in 24918700 146, 0Error in 0-8.693182536e+15-8.693182536e+15 <= Out[-8.693182536e+15146, 0] = 24918700 < = A[168, 2Error in Error in -8.693182536e+15 <= Out[146, 0] = -8.693182536e+15 <= Out[146 8.882994316e+19 <= Out[146, 0] = ] = 1.278456275e+20 <= 24918700 ] =Error in 3.367993098e+19 <= 24918700 <= 249187004.212236121e+19 <= Error in -1.099486709e+12-8.693182536e+15 <= Out[146 24918700 <= <= Out[Error in <= 24918700 , 0] = 9.954809299e+19 <= 24918700 Error in -8.693182536e+15 <= Out[146, 0Error in 24918700 -8.693182536e+15 <= Out[146, 0] = 7.886610957e+19 <= 24918700 Error in -8.693182536e+15 <= Out[168Error in 0 < = A[168, 2] =-1.099486709e+12 <= 249187000 < = A[168Error in 0 < = A[, 0] = 4.251490206e+18 <= 24918700Error in 168, 2 Error in -8.693182536e+15 <= Out[0 < = A[168, 2] =-1.099486709e+12-8.693182536e+15 <= Out[146146, 0] = , 0] = 146, 0] =1.577838833e+20 <= <= 24918700 , 2] =-1.099486709e+12] = 9.908229452e+1924918700 <= -1.099486709e+12 <= 1.165877968e+19 ] = Error in Error in 24918700 Error in 24918700 -8.693182536e+15 <= Out[168, 00-8.693182536e+15 <= Out[168, Error in <= <= 24918700 < = A[168, 22.06352835e+20Error in <= 24918700 Error in -8.693182536e+15 <= Out[168, -8.693182536e+15, 0] =Error in 0 < = A[168, 2] =] = 249187000] = 6.780623378e+18 0 < = A[168, 2] = <= Out[-1.099486709e+12-1.099486709e+12 <= 168, 1.456861685e+20 <= <= 24918700 24918700 24918700 <= ] = 0] = 8.255781624e+18 <= 24918700 Error in Error in -1.099486709e+12 <= 24918700 -8.693182536e+15 <= Out[168, 0] = 1.877466594e+19 <= -8.693182536e+15 <= Out[8.734754229e+18 <= Error in Error in 0 < = A[168024918700 Error in Error in Error in 24918700 ] = 0 < = A[1680 < = A[168, 224918700] =0, 0-8.693182536e+15 <= Out[168, 01.229608547e+18 <= , 2] =] = -1.099486709e+1224918700 ] = -1.099486709e+12 <= 24918700 Error in -1.444060111e+22 <= 24918700 8.588775028e+18 <= 24918700 Error in -8.693182536e+15 <= Out[168, 0] = 2.360231114e+19 <= 24918700 , 2] =-8.693182536e+15 <= Out[168, 0] = 1.230140043e+19 <= 24918700 <= 24918700 Error in -8.693182536e+15 <= Out[168, 0] = 2.320036417e+19 <= 24918700 -1.099486709e+12 <= 24918700 Error in -8.693182536e+15 <= Out[168, 0] = 7.970668397e+18 <= 24918700 0 < = A[168, 2] =-1.099486709e+12 <= 24918700 Error in -8.693182536e+15 <= Out[168, 0] = 7.75984419e+18 <= 24918700 Error in -8.693182536e+15 <= Out[24, 0] = -2.114967749e+22 <= 24918700 Error in 0 < = A[19, 6] =-1.715495048e+10 <= 24918700 Error in -8.693182536e+15 <= Out[19, 0] = 7.792845232e+16 <= 24918700 Error in -8.693182536e+15 <= Out[72, 0] = -2.999946892e+20 <= 24918700 Error in 0 < = B[11, 24] =-1.014266402e+10 <= 24918700 Error in -8.693182536e+15 <= Out[0, 24] = 1.933749029e+17 <= 24918700 Error in 0 < = A[144, 272] =-1.715495048e+10 <= 24918700 Error in 0 < = B[291, 299] =-1.014266402e+10 <= 24918700 Error in 0 < = B[291, 299] =-1.014266402e+10 <= 24918700 Error in -8.693182536e+15 <= Out[0, 299] = 1.669277974e+17 <= 24918700 Error in -8.693182536e+15 <= Out[0, 299] = 2.151045413e+17 <= 24918700 Error in -8.693182536e+15 <= Out[96, 0] = -1.93637004e+21 <= 24918700 Error in -8.693182536e+15 <= Out[24, 0] = -2.641855819e+21 <= 24918700 Error in -8.693182536e+15 <= Out[24, 0] = -1.916416862e+21 <= 24918700 Error in -8.693182536e+15 <= Out[24, 0] = -2.514759448e+21 <= 24918700 FAILED (0.104459)

jgdumas avatar Jun 04 '19 07:06 jgdumas

Although I could reproduce your bug on the same machine linking against your install of OpenBLAS, I then built a fresh OpenBLAS from upstream develop branch. I suspect that your install of OpenBLAS is causing the bug.

ClementPernet avatar Jun 11 '19 13:06 ClementPernet

OK, I think the problem comes from the fact that I compiled OpenBLAS with 'USE_THREAD=0'

jgdumas avatar Jun 11 '19 13:06 jgdumas