dseupd_ randomly fails on mips64el arch
Expected behavior
dseupd_ always succeeds on any arch
Actual behavior
dseupd_ randomly fails on mips64el arch
Where/how to reproduce the problem
- arpack-ng: release 3.8.0
- OS: debian Sid (20210202)
- compiler: gcc
- environment:
- configure:
- input data:
Steps to reproduce the problem
- compile bugreport-arpack-dseupd.c against the arpack library as distributed in Debain Sid on a mips64el computer.
- enter the command-line
while (./bugreport-arpack-dseupd) ; do : ; done | nland wait until it stops.
Error message
Traces
Callstack
Notes, remarks
The bug is reported in Debian #981646
Patches welcome ! I am not planing to spend time on this, sorry!
Do you have any hint ?
Nope, sorry :(
Can't afford spending time on that neither ?!...
Did you try ISO_C_BINDING : cmake -DICB=ON and replace old fashioned dseupd_ (I've seen some in your bug report) by type compliant (insured by compiler) dseupd_c
Thanks for your suggestion. What it would look like for the test code ? just replace dseupd_ with dseupd_c ?
Thanks for your suggestion. What it would look like for the test code ? just replace dseupd_ with dseupd_c ?
Yes. With ICB enabled, make install will install arpack.h. Include arpack.h, and, replace any of the [sdcz][sn][ae]upd_ by the associated [sdcz][sn][ae]upd_c. dseupd_c is just a "C/F90 gateway" which calls the original dseupd_ behind the scene but with correct types insured by the compiler according to Fortran 2003 standard (an example described here : https://scc.ustc.edu.cn/zlsc/chinagrid/intel/compiler_f/main_for/GUID-C0FA4080-4548-4045-9E6D-EFE57A0DE1A3.htm) so the compiler version you use must support F2003 standard (or configure should fail).
Thanks for the precisions. I transformed my test sample accordingly: bugreport-arpack-dseupd_c.c . Unfortunately the error persists.
@jgmbenoit: howmny seems to be character*1
https://github.com/opencollab/arpack-ng/blob/872df83b9e4f493f4bf68457a96edbcafafd5e4c/SRC/dseupd.f#L54
Did you try that?
- char *all="All";
+ char *all="A";
It shouldn't make a different if the buffer has extra characters though. dseupd can choose to read just a single one.
I do wonder about the appropriate calling convention for this function. Doesn't gfortran require passing the string lengths as additional parameters at the end?
Doesn't gfortran require passing the string lengths as additional parameters at the end?
Don't think so
Could you compile bugreport-arpack-dseupd_c.c in debug mode on mips64el and report here a backtrace? May be helpful to get some clue
I am not so sure what I may read by ``in debug''. Can you specify the sequence of command to follow ?
In bugreport-arpack-dseupd_c.c, why do you set double tol=+0x1p-53;? This is a weird format that I never seen before (not sure to understand it), but, it sounds like 1e-53. If so, why do you need such a strict tolerance? It is OK with just double tol=1.e-10;. Arpack is an iterative search/method so using very-very strict tolerance has no meaning (at least as I understand it: there is no way you get an exact result).
Are you sure you never hit ido == 2? (https://github.com/igraph/igraph/blob/89df4735b20737fdfcaa7457a594713ae9ac3e04/src/linalg/arpack.c#L989). In this case: https://github.com/opencollab/arpack-ng/blob/d280435c25a7665952475d3da6bcd6c05ee2e774/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp#L797
not so sure what I may read by ``in debug'
compile with -g -O0 and report a backtrace if possible
In bugreport-arpack-dseupd_c.c, why do you set
double tol=+0x1p-53;? This is a weird format that I never seen before (not sure to understand it), but, it sounds like1e-53.
That is 2^-53, not 10^-53. Probably related to double storing 53 digits.
- Try to use a regular tolerance (1.e-10 or 1.e-12 - below this has no meaning and may even trigger numerical instabilities).
- Make sure you don't miss cases where
ido == 2in the **aupd-while loop, otherwise, iterations are not over and you may don't even get an eigen vector! :D (in case bugreport-arpack-dseupd_c.c has missedidocases, this may explain crash in dseupd call - I mean the math behind the scene need well-enough conditioned matrices, etc). - Trap unexpected
idoto make sure you don't screw the RCI pattern https://github.com/opencollab/arpack-ng/blob/d280435c25a7665952475d3da6bcd6c05ee2e774/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp#L802 As far as I remember, if you don't use implicit shifts, you never hitido == 3but otherwise you must handle this case too https://github.com/opencollab/arpack-ng/blob/d280435c25a7665952475d3da6bcd6c05ee2e774/SRC/dsaupd.f#L155 To me, the most simple solution is to always use exact shifts https://github.com/opencollab/arpack-ng/blob/d280435c25a7665952475d3da6bcd6c05ee2e774/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp#L725 - Break the **aupd-while loop only if
ido == 99https://github.com/opencollab/arpack-ng/blob/d280435c25a7665952475d3da6bcd6c05ee2e774/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp#L807
I had a hard time with arpack too... Hope this helps
Including @ntamas here regarding the gfortran calling convention.
@szhorvat Re the calling convention: I think it depends on whether you are using gfortran or not, and whether you are using sibling-call optimizations. More details in the R blog, start reading from the "Strings in Fortran and LAPACK" section.
To me, looks more like an arpack miuse that an arpack bug. @jgmbenoit: if so, when OK at your side, don't forget to close this issue at our side. Thanks!
I will a have a closer look this week-end.
2. Make sure you don't miss cases where
ido == 2in the **aupd-while loop, otherwise, iterations are not over and you may don't even get an eigen vector! :D (in case bugreport-arpack-dseupd_c.c has missedidocases, this may explain crash in dseupd call - I mean the math behind the scene need well-enough conditioned matrices, etc).
bmat was set to 'I' for standard problems and there is no B matrix. Doesn't that mean that there will never be an ido == 2 value? That's my understanding from reading the docs here.
Doesn't that mean that there will never be an ido == 2 value?
Sounds likely, but, not sure: doc is not clear to me (when bmat = 'I' you have actually B = Identity so that ido == 2 is not a no-op: you replace Y with X before next iteration which re-use the modified Y... Right?). From here, your biggest problem looks like to be the tolerance: below tol = 1.e-12, tol has no meaning (arpack is all about float and double which max tol is 1.e-12: this means you can hope at best to have 12 significant digits [in debug mode, less in release mode], so that if you ask for example tol = 1.e-15 the 13th, 14th, 15th will anyway have no meaning... But will take more time to compute and may increase risks of numerical instabilities). So if you ask tol = 2^-53 ?!... You are risking even more!
May be wrong but that's my impression (had a hard time with arpack long time ago!)
@szhorvat: BTW, you may want/need to make some statistics with your baseline data/use-cases.
Say you ask for tol = 1.e-12 and maxit = 100, but that, you observe on your baseline data/use-cases that you get poor accuracy (say "bad" eigen vectors/values only 1.e-6 accurate): here, you may want/need to overkill maxit (maxit = 200 or maxit = 1000). tol is not the only stopping criterion for arpack: arpack will stop if tol is achieved or maxit is achieved. From my experience, keeping tol under control (w.r.t the need) but increasing maxit is a better way to get "accurate" results (w.r.t the need).
Note: with tol = 2^-53, it's likely you always stop at it = maxit as there is no way to reach such a tol! Check out how your baseline use-cases run
I believe the 2^-53 tolerance did not come from igraph (doesn't seem to be present in the source code). I check for IDO values within igraph now: if it's not a value that's handled by the code, it throws an error. The test suite did not reveal such a case.
If you don't have ido == 2, make sure you use a regular tolerance (1.e-12), check maxit and the quality of the results (are your vectors eigen up-to the tolerance you pass to arpack?).
arpack-ng provides arpackmm (https://github.com/opencollab/arpack-ng/tree/master/EXAMPLES/MATRIX_MARKET - cmake -DICBEXMM=ON ..) and pyarpack(https://github.com/opencollab/arpack-ng/tree/master/EXAMPLES/PYARPACK - cmake -DPYTHON3=ON .. based on Boost.python3) that may help you to check/compare results (if you have iparam[6] = 2/3, shifting and/or the kind of solver you use may have big impacts on results - this is easily checked/compared with arpackmm or pyarpack).
Hope this helps
BTW, you need to set at least nbCV = 2 * nbEV + 1, but, if you set nbCV > 2 * nbEV + 1 you may get better/faster CV: arpackmm / pyarpack let you play with these parameters too.
BTW, you need to set at least
nbCV = 2 * nbEV + 1, but, if you setnbCV > 2 * nbEV + 1you may get better/faster CV
I think these are already satisfied in our code in igraph; igraph_arpack_rnsolve() is "just" a generic ARPACK wrapper, but it tries to set nbCV automatically if the caller specifies 0 there (which is always the case in all places where we use the solver from igraph). This is done by this function.
Anyhow, we've made some changes in igraph and now try to handle all cases of ido (even ido == 2); not sure if this will help with the original issue, but I think we should definitely test it before pursuing this further in arpack-ng. @jgmbenoit igraph 0.10 is just around the corner, so I was wondering what your preference would be:
- wait until igraph 0.10 comes out and then test whether this bug still appears on
mips64el, or - patch igraph 0.9.x according to https://github.com/igraph/igraph/commit/a17105beb3e02cedd5f12dc4b4390041ba28ec58 and see if it solves the problem
@ntamas It appears that the last igraph versions built well on all architecture (so on mips64el arch) that Debian officially supports (except on the s390x architecture because of a bug of mine in plfit). So I guess you can go ahead with igraph version 0.10 . I may package it this weekend, and I may have a closer look on this particular issue alongside. This bug is quite old, so a lot of things has changed in Debian since then. In particular, the GCC has been upgraded.
I went back to a mips64el arch box to check the current status of the issue. The check-code bugreport-arpack-dseupd_c.c is still relevant as it fails on the mips64el arch box but it works on my amd64 box. Now gcc is gcc version 11.3.0. I ran several times the tests (make -C _BUILD_SHARED check): the tests 174 (igraph_pagerank) and 340 (example::eigenvector_centrality) failed each time on the mips64el arch box but succeeded on my amd64 box. This was for igraph-0.9.9+ds-1 without extra patch.
It appears that patch a17105beb3e cannot be applied to version0.9.9.
I guess that number of changes to getting too big. So I suggest to come back as soon as version 0.10 is out (and packaged).
That is 2^-53, not 10^-53. Probably related to
doublestoring 53 digits.
Exactly. I used this format to be sure that the code plays with exactly the same doubles. We are debugging a weird issue here.