fflas-ffpack
fflas-ffpack copied to clipboard
test-charpoly-check failing on sparc64
Hello!
I just packaged fflas-ffpack 2.2.2 for Debian, and it is failing to build on several architectures due to failing tests. (The following are pasted from the mips build log [1].)
FAIL: test-lu
=============
[...]
Checking ..............Modular<Integer> modulo 29 ... rank is wrong (expecting 11 but got 0)
rank is wrong (expected 11 but got 0)
failed at big lda
rank is wrong (expecting 20 but got 0)
rank is wrong (expected 20 but got 0)
failed at big lda max rank
failed at big lda, rank 0
rank is wrong (expecting 10 but got 0)
rank is wrong (expected 10 but got 0)
failed at square
rank is wrong (expecting 15 but got 0)
rank is wrong (expected 15 but got 0)
failed at wide
rank is wrong (expecting 7 but got 0)
rank is wrong (expected 7 but got 0)
failed at narrow
rank is wrong (expecting 11 but got 0)
rank is wrong (expected 11 but got 0)
failed at big lda
rank is wrong (expecting 20 but got 0)
rank is wrong (expected 20 but got 0)
failed at big lda max rank
failed at big lda, rank 0
rank is wrong (expecting 10 but got 0)
rank is wrong (expected 10 but got 0)
failed at square
rank is wrong (expecting 15 but got 0)
rank is wrong (expected 15 but got 0)
failed at wide
rank is wrong (expecting 7 but got 0)
rank is wrong (expected 7 but got 0)
failed at narrow
rank is wrong (expecting 11 but got 0)
rank is wrong (expected 11 but got 0)
failed at big lda
rank is wrong (expecting 20 but got 0)
rank is wrong (expected 20 but got 0)
failed at big lda max rank
failed at big lda, rank 0
rank is wrong (expecting 10 but got 0)
rank is wrong (expected 10 but got 0)
failed at square
rank is wrong (expecting 15 but got 0)
rank is wrong (expected 15 but got 0)
failed at wide
rank is wrong (expecting 7 but got 0)
rank is wrong (expected 7 but got 0)
failed at narrow
rank is wrong (expecting 11 but got 0)
rank is wrong (expected 11 but got 0)
failed at big lda
rank is wrong (expecting 20 but got 0)
rank is wrong (expected 20 but got 0)
failed at big lda max rank
failed at big lda, rank 0
rank is wrong (expecting 10 but got 0)
rank is wrong (expected 10 but got 0)
failed at square
rank is wrong (expecting 15 but got 0)
rank is wrong (expected 15 but got 0)
failed at wide
rank is wrong (expecting 7 but got 0)
rank is wrong (expected 7 but got 0)
failed at narrow
FAILED
FAIL: test-echelon
==================
Checking ...............Modular<double> mod 76667 .........PASSED
Checking ..................Modular<double> mod 23 .........PASSED
Checking ..................Modular<double> mod 89 .........PASSED
Checking ......ModularBalanced<double> mod 561181 .........PASSED
Checking ..........ModularBalanced<double> mod 31 .........PASSED
Checking ........ModularBalanced<double> mod 2503 .........PASSED
Checking ..................Modular<float> mod 223 .........PASSED
Checking ..................Modular<float> mod 151 .........PASSED
Checking ..................Modular<float> mod 359 .........PASSED
Checking .........ModularBalanced<float> mod 1283 .........PASSED
Checking ..........ModularBalanced<float> mod 421 .........PASSED
Checking .........ModularBalanced<float> mod 1259 .........PASSED
Checking .Modular<int32_t, uint32_t> modulo 11003 .........PASSED
Checking ....Modular<int32_t, uint32_t> modulo 29 .........PASSED
Checking .....Modular<int32_t, uint32_t> modulo 3 .........PASSED
Checking ......ModularBalanced<int32_t> mod 13499 .........PASSED
Checking .........ModularBalanced<int32_t> mod 73 .........PASSED
Checking .......ModularBalanced<int32_t> mod 6871 .........PASSED
Checking .....Modular<int64_t, int64_t> modulo 43 .........PASSED
Checking Modular<int64_t, int64_t> modulo 52689971 .........PASSED
Checking Modular<int64_t, int64_t> modulo 820673699 .........PASSED
Checking .......ModularBalanced<int64_t> mod 7433 .........PASSED
Checking .....ModularBalanced<int64_t> mod 359663 .........PASSED
Checking .....ModularBalanced<int64_t> mod 107137 .........PASSED
FAIL test-echelon (exit status: 139)
FAIL: test-rankprofiles
=======================
[...]
Checking Modular<Integer> modulo 272998032472030762247254850999851950143 ... FAILED
FAIL test-rankprofiles (exit status: 1)
FAIL: test-fgemm
================
[...]
Checking Modular<Integer> modulo 8549871607103756297543434634416548303828878605453302128157720522884613235851910725097929051316586481512924427759295421112023965613109239381708161612926577 ... FAIL
a :1, b : 0
m :15, n : 26, k : 26
ldA :32, ldB : 32, ldC : 30
Error C[0,0]=0 D[0,0]=467492732658410772275091190117627655542253651983001388578517645829495781150794733684597850114989973765636043935395580403988600206394352625839543034249789
Error C[0,1]=0 D[0,1]=3745850637687372787032133162786181752923976304361774184687299100090593606189640549339756280744799085424074197777123010114450581511671044936589874969831713
Error C[0,2]=0 D[0,2]=8286404725316917254688714949260944254011537941467078201617202765729213187054127467488830363130801792602756153724938361376488366188203128350292301369873926
Error C[0,3]=0 D[0,3]=7719842182036168601239979864756689654896968016098854378423193365280462943223639927166648047194689950133225952138330062822670878063601456021183476952591029
Error C[0,4]=0 D[0,4]=2024153580475532119587258559457193287346898570762148019971431860242110513024559260011015946501434441705820455529574060348920621050772638304443142900509912
Error C[0,5]=0 D[0,5]=20141380879160588302903723680879607892681391439496049917031962141045240109793641500262038502579839897910979174361960938528280325891868479883106216500846
Error C[0,6]=0 D[0,6]=4391623757211580394981142711365624473283723132004364468578042038987954220345931201010465272737045531053542896869796518680065036563453611907149726297278828
Error C[0,7]=0 D[0,7]=7145044132612825887529996517477527837989145729527347602699814003707490049064275886634152750968051932467580700126399656627669145655997899348019974398331288
Error C[0,8]=0 D[0,8]=1162056415306067048056344432602188553248849536319725726925267530516646030835101050918971464286388162765976112658166266823492179915873703334802593027996927
Error C[0,9]=0 D[0,9]=5391984716822375185454705906327368550846001455505002431488537190586214905645659123376661256794728494606528351826240719416122998491862837842906524825067266
Error C[0,10]=0 D[0,10]=2193532197288498574734861369684182498055535867358417603399364153343391698691030762288301969500710193967693347886190582912066842231271246096182892095605959
Error C[0,11]=0 D[0,11]=2624532911059920179180095706707244358148606542398416052415347939403828134468523920604669737503086596630490776845197767571758276278278191461009633193828347
Error C[0,12]=0 D[0,12]=3824058857929448047878016258319328821213902355230152570114023373874669070193964224840157989142795725361058539846319450404858289844120222578875049028092694
Error C[0,13]=0 D[0,13]=8488926214544649928946158289983257571986404239711143898331161061672932983731253748159571045407716898119928713472262377738179411236640573023199132821771987
Error C[0,14]=0 D[0,14]=8446230141577834220808492590599058489726893360083666083959660498179870853695198490739565290083853052922584335433857016447280120146953798248293970266650541
Error C[0,15]=0 D[0,15]=5458897513768857520249369908248619852903197296497582007240795265534584137713653353959244315841525059439028312283991843795889213682520912258742727419218179
Error C[0,16]=0 D[0,16]=6715744657584867924437252754908658736316361517364899772634395902309515686261743192031073767359337282039814832752313092806684013255350004845231712908981814
Error C[0,17]=0 D[0,17]=1180550868429558577387803899302922946040118199804321268305374976108423786550517468434261788762720736487189789785499144845987141527162494605060510016163013
Error C[0,18]=0 D[0,18]=5068758751300120035784900616768905386216966948934546078944345181003163460523368240846960084376983129885419806701841041251611064084608879583409778801011039
Error C[0,19]=0 D[0,19]=8516674032910371447473511292247098025582897666097326091212928806484548818841348645545236340701615335664105781567594799896950348911481513472413885443418184
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX
FAIL
The same tests fail on powerpc [2], s390x [3], and hppa [4].
On sparc64 [5], all of the above tests fail in addition to:
FAIL: test-pluq-check
=====================
terminate called after throwing an instance of 'FailureTrsmCheck'
FAIL test-pluq-check (exit status: 134)
On armel [6], only test-rankprofiles and test-fgemm fail.
Thank you!
[1] https://buildd.debian.org/status/fetch.php?pkg=fflas-ffpack&arch=mips&ver=2.2.2-1&stamp=1472626266 [2] https://buildd.debian.org/status/fetch.php?pkg=fflas-ffpack&arch=powerpc&ver=2.2.2-1&stamp=1472625237 [3] https://buildd.debian.org/status/fetch.php?pkg=fflas-ffpack&arch=s390x&ver=2.2.2-1&stamp=1472625205 [4] https://buildd.debian.org/status/fetch.php?pkg=fflas-ffpack&arch=hppa&ver=2.2.2-1&stamp=1472627635 [5] https://buildd.debian.org/status/fetch.php?pkg=fflas-ffpack&arch=sparc64&ver=2.2.2-1&stamp=1472625542 [6] https://buildd.debian.org/status/fetch.php?pkg=fflas-ffpack&arch=armel&ver=2.2.2-1&stamp=1472627148
From our Fedora builds - the test-lu enters an endless loop even
These test failures have now been reported as a "release critical" bug, i.e., fflas-ffpack won't be included in the next Debian release unless they're fixed.
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=840454 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=840455 (sparc64 test-pluq-check only)
In case of PowerPC, test-lu is stuck at
Program received signal SIGINT, Interrupt.
0x00003fffb6f8fb40 in .__floorf_power5plus () from /lib64/power8/libm.so.6
(gdb) bt
#0 0x00003fffb6f8fb40 in .__floorf_power5plus () from /lib64/power8/libm.so.6
#1 0x000000002002826c in std::floor (__x=
I ran the test again, but this time I compiled the package with debugging enabled. This is the output:
(gdb) bt
#0 0x00003fffb6c5f818 in .raise () from /lib64/power8/libc.so.6
#1 0x00003fffb6c61f64 in .abort () from /lib64/power8/libc.so.6
#2 0x00003fffb70f0b84 in .__gnu_cxx::__verbose_terminate_handler() ()
from /lib64/libstdc++.so.6
#3 0x00003fffb70ed484 in ?? () from /lib64/libstdc++.so.6
#4 0x00003fffb70ed528 in .std::terminate() () from /lib64/libstdc++.so.6
#5 0x00003fffb70ed94c in .__cxa_throw () from /lib64/libstdc++.so.6
#6 0x0000000010062d10 in FFLAS::CheckerImplem_fgemm<Givaro::Modular<float, float> >::check (C=0x101b6a84, ldb=130, B=0x101b687c, lda=130, A=0x101b6a80,
alpha=112, tb=FFLAS::FflasNoTrans, ta=
This is the error when running the tests without optimization on the s390x architecture:
g++ -DHAVE_CONFIG_H -I. -I.. -I.. -g -I../fflas-ffpack/ -I../fflas-ffpack/utils/ -I../fflas-ffpack/fflas/ -I../fflas-ffpack/ffpack -I../fflas-ffpack/field -Wdate-time -D_FORTIFY_SOURCE=2 -O0 -Wall -DNDEBUG -UFFLASFFPACK_DEBUG -std=gnu++11 -D__FFLASFFPACK_HAVE_CBLAS -fopenmp -g -fdebug-prefix-map=/home/thansen/fflas-ffpack-2.2.2=. -fstack-protector-strong -Wformat -Werror=format-security -fabi-version=6 -c -o test-lu.o test-lu.C ../fflas-ffpack/utils/bit_manipulation.h: Assembler messages: ../fflas-ffpack/utils/bit_manipulation.h:114: Error: Unrecognized opcode: `divq'
The PPC64 build is stuck because NaNs have somehow gotten into the matrix. I haven't tracked that part down yet, but execution is stuck in the loop in invext at /usr/include/givaro/modular-general.inl, lines 66 through 87, because computation with NaNs just yields more Nans, so v3 never converges to zero.
The failing tests all seem to use ModularGivaro::Integer as the Field type. I tried seeding the random number generators with identical values on an x86_64 and a PPC64 machine, so I could use binary search to find where they start to differ. But that's not working because this code reseeds the random number generator with the current time in several places. I have found several, but apparently haven't tracked them all down yet. I'm happy to help the developers debug this issue, but could you please give me a list of every place where the random number generator is reseeded in both givaro and fflas-ffpack? I'd like to encourage you to stop doing this. It makes this kind of debugging impossible. Seed the generators once at the very beginning of the execution of a program and then leave them alone.
At least part of the problem appears to be that fflas-ffpack/field/rns-double.h, fflas-ffpack/field/rns-double.inl, and fflas-ffpack/field/rns-double-recint.inl access the limbs of an mpz_t 16 bits at a time. While the limbs of an mpz_t are in little endian order, the bytes in a limb are in host byte order. However, code in those 3 files appears to assume that the bytes are in little endian order. Look for uint16_t declarations in those files and examine how they are used. I tried throwing together a quick patch for the problem but, alas, I'm still seeing NaNs in the matrix, so either I did not fix the problem correctly or there is yet another problem somewhere.
Actually, my quick patch DOES fix test-fgemm, but not test-lu. So I did something right. :-) Perhaps somebody else can see what I either did wrong or left out. The patch can be viewed here: http://jamezone.org/pleasure/software/fflas-ffpack-endian.patch
I can confirm that with the latest patch fflas-ffpack passes the test-suite on s390x - https://s390.koji.fedoraproject.org/koji/taskinfo?taskID=2438443
Great! And a scratch build for rawhide shows only the ppc64 task hanging: https://koji.fedoraproject.org/koji/taskinfo?taskID=17166024, so all the other test failures go away with this patch. We're still getting NaNs in test-lu with ppc64, though, and I don't know why. :-(
Pull request created: https://github.com/linbox-team/fflas-ffpack/pull/72
The problem with NaNs on ppc64 appears to be due to a bug in ATLAS. The same sources succeed with openblas. The fflas-ffpack code invokes cblas_sgemm with some very ordinary-looking matrices, and down inside the ATLAS code (specifically, ATL_USERMM in ppc64_base/src/blas/gemm/KERNEL/ATL_sNBmm_b0.c), the NaNs are generated. So I think the maintainers should have a look at the pull request, and that should be the end of this issue.
@jamesjer's patches fixed the build on the big endian architectures in Debian! [1]
We tried using -fno-strict-aliasing for armel as Fedora does, but test-lu is still failing. [2] (It worked for me on a local schroot, but failed on Debian's build machines.)
On sparc64, test-pluq-check is still failing. [3]
[1] https://buildd.debian.org/status/logs.php?pkg=fflas-ffpack&ver=2.2.2-3&suite=sid [2] https://buildd.debian.org/status/fetch.php?pkg=fflas-ffpack&arch=armel&ver=2.2.2-4&stamp=1483919946 [3] https://buildd.debian.org/status/fetch.php?pkg=fflas-ffpack&arch=sparc64&ver=2.2.2-4&stamp=1483911466
Catching with this thread. Thanks for catching this bug. I was clueless as I did not have access to big endian archs. As you all seem to consider that PR #72 fixes it, and looking at the code which is fine. I'm happy to merge it. Regarding the random seeding, sorry about this mess. We just recently started to have a systematic option to seed the generators, and still did not clean all old pieces of code where it is seed from the time. Will do.
@d-torrance What blas implementation do you use on armel and sparc64? Also, I'm not certain that -fno-strict-aliasing actually does anything useful. GCC doesn't warn about any aliasing issues. It might be a fluke that I added that option just when some other factor made the armel build failures go away.
@jamesjer Right now all architectures are using the default Netlib BLAS which comes with LAPACK.
Updating with the still outstanding issues:
armel (build log) Failing tests:
- test-lu
- test-rankprofiles
- test-fgemm
sparc64 (build log) Failing tests:
- test-pluq-check
I've just packaged version 2.3.2 for Debian, and test-plug-check is still failing on sparc64, along with test-invert-check and test-charpoly-check. Not sure about armel yet.
FAIL: test-pluq-check
=====================
terminate called after throwing an instance of 'FailureTrsmCheck'
FAIL test-pluq-check (exit status: 134)
FAIL: test-invert-check
=======================
-q 131071 -n 0 -i 0 -s 1515633011249149
terminate called after throwing an instance of 'FailureFgemmCheck'
m= 480
FAIL test-invert-check (exit status: 134)
FAIL: test-charpoly-check
=========================
CHARPol server PLUQ : 0.00215793s (0.002157 cpu) [1]
CHARPol client CHECK: 0.00063777s (0.000706 cpu) [4]
CHARPol checked full: 0.0128729s (0.785647 cpu) [1]
72x72 charpoly verification successful
CHARPol server PLUQ : 0.00340104s (0.008869 cpu) [1]
CHARPol client CHECK: 0.000571012s (0.08921 cpu) [4]
CHARPol checked full: 0.0189021s (0.780353 cpu) [1]
89x89 charpoly verification successful
CHARPol server PLUQ : 0.000622034s (0.003978 cpu) [1]
CHARPol client CHECK: 0.000258923s (0.000254 cpu) [4]
CHARPol checked full: 0.00458193s (0.168126 cpu) [1]
FAIL test-charpoly-check (exit status: 138)
Commit d8cd67d is likely to have fixed it.
I renamed the issue since there's only one test on one architecture still failing with the Debian package of version 2.4.3, test-charpoly-check on sparc64.
From https://buildd.debian.org/status/fetch.php?pkg=fflas-ffpack&arch=sparc64&ver=2.4.3-1&stamp=1593511031&raw=0:
libtool: link: g++ -O2 -Wall -g -I.. -g -O2 -fdebug-prefix-map=/<<PKGBUILDDIR>>=. -fstack-protector-strong -Wformat -Werror=format-security -fabi-version=6 -fabi-version=6 -fopenmp -Wl,-z -Wl,relro -o test-charpoly-check test-charpoly-check.o -fopenmp -lgivaro -lgmp -lgmpxx -lblas -llapack -fopenmp
l../build-aux/test-driver: line 107: 318084 Bus error "$@" > $log_file 2>&1
FAIL: test-charpoly-check
FAIL: test-charpoly-check
=========================
CHARPol server PLUQ : 9.98974e-05s (0 cpu) [1]
CHARPol client CHECK: 0.000381947s (0 cpu) [4]
CHARPol checked full: 0.00181389s (0 cpu) [1]
10x10 charpoly verification successful
CHARPol server PLUQ : 0.00118399s (0 cpu) [1]
CHARPol client CHECK: 0.000505924s (0 cpu) [4]
CHARPol checked full: 0.018975s (0 cpu) [1]
74x74 charpoly verification successful
CHARPol server PLUQ : 0.000689983s (0 cpu) [1]
CHARPol client CHECK: 0.000377893s (0 cpu) [4]
CHARPol checked full: 0.0131581s (0 cpu) [1]
FAIL test-charpoly-check (exit status: 138)
I also got this test failure in a PPA build of the master branch on s390x in Ubuntu 18.04:
make[5]: Entering directory '/<<PKGBUILDDIR>>/tests'
../build-aux/test-driver: line 107: 15724 Aborted (core dumped) "$@" > $log_file 2>&1
FAIL: test-charpoly-check
...
FAIL: test-charpoly-check
=========================
CHARPol server PLUQ : 0.00052619s (0 cpu) [1]
CHARPol client CHECK: 0.000102282s (0 cpu) [4]
CHARPol checked full: 0.00742912s (0.001392 cpu) [1]
83x83 charpoly verification successful
CHARPol server PLUQ : 0.000433922s (0 cpu) [1]
terminate called after throwing an instance of 'FFPACK::CharpolyFailed'
FAIL test-charpoly-check (exit status: 134)
Just saw the same problem on x86_64 with version 2.5.0. Probably intermittent.
FAIL: test-charpoly-check
=========================
CHARPol server PLUQ : 7.79629e-05s (6.2e-05 cpu) [1]
CHARPol client CHECK: 0.000102997s (4.2e-05 cpu) [4]
CHARPol checked full: 0.00166917s (0.000712 cpu) [1]
28x28 charpoly verification successful
CHARPol server PLUQ : 0.000334024s (0.000334 cpu) [1]
terminate called after throwing an instance of 'FFPACK::CharpolyFailed'
FAIL test-charpoly-check (exit status: 134)