arpack-ng icon indicating copy to clipboard operation
arpack-ng copied to clipboard

dseupd_ randomly fails on mips64el arch

Open jgmbenoit opened this issue 5 years ago • 40 comments

Expected behavior

dseupd_ always succeeds on any arch

Actual behavior

dseupd_ randomly fails on mips64el arch

Where/how to reproduce the problem

  • arpack-ng: release 3.8.0
  • OS: debian Sid (20210202)
  • compiler: gcc
  • environment:
  • configure:
  • input data:

Steps to reproduce the problem

  1. compile bugreport-arpack-dseupd.c against the arpack library as distributed in Debain Sid on a mips64el computer.
  2. enter the command-line while (./bugreport-arpack-dseupd) ; do : ; done | nl and wait until it stops.

Error message

Traces

Callstack

Notes, remarks

The bug is reported in Debian #981646

jgmbenoit avatar Feb 03 '21 15:02 jgmbenoit

Patches welcome ! I am not planing to spend time on this, sorry!

sylvestre avatar Feb 03 '21 15:02 sylvestre

Do you have any hint ?

jgmbenoit avatar Feb 03 '21 16:02 jgmbenoit

Nope, sorry :(

sylvestre avatar Feb 03 '21 16:02 sylvestre

Can't afford spending time on that neither ?!... Did you try ISO_C_BINDING : cmake -DICB=ON and replace old fashioned dseupd_ (I've seen some in your bug report) by type compliant (insured by compiler) dseupd_c

fghoussen avatar Feb 04 '21 18:02 fghoussen

Thanks for your suggestion. What it would look like for the test code ? just replace dseupd_ with dseupd_c ?

jgmbenoit avatar Feb 04 '21 19:02 jgmbenoit

Thanks for your suggestion. What it would look like for the test code ? just replace dseupd_ with dseupd_c ?

Yes. With ICB enabled, make install will install arpack.h. Include arpack.h, and, replace any of the [sdcz][sn][ae]upd_ by the associated [sdcz][sn][ae]upd_c. dseupd_c is just a "C/F90 gateway" which calls the original dseupd_ behind the scene but with correct types insured by the compiler according to Fortran 2003 standard (an example described here : https://scc.ustc.edu.cn/zlsc/chinagrid/intel/compiler_f/main_for/GUID-C0FA4080-4548-4045-9E6D-EFE57A0DE1A3.htm) so the compiler version you use must support F2003 standard (or configure should fail).

fghoussen avatar Feb 05 '21 17:02 fghoussen

Thanks for the precisions. I transformed my test sample accordingly: bugreport-arpack-dseupd_c.c . Unfortunately the error persists.

jgmbenoit avatar Feb 08 '21 13:02 jgmbenoit

@jgmbenoit: howmny seems to be character*1 https://github.com/opencollab/arpack-ng/blob/872df83b9e4f493f4bf68457a96edbcafafd5e4c/SRC/dseupd.f#L54

Did you try that?

- char *all="All";
+ char *all="A";

fghoussen avatar Jul 09 '22 10:07 fghoussen

It shouldn't make a different if the buffer has extra characters though. dseupd can choose to read just a single one.

I do wonder about the appropriate calling convention for this function. Doesn't gfortran require passing the string lengths as additional parameters at the end?

szhorvat avatar Jul 09 '22 10:07 szhorvat

Doesn't gfortran require passing the string lengths as additional parameters at the end?

Don't think so

fghoussen avatar Jul 09 '22 11:07 fghoussen

Could you compile bugreport-arpack-dseupd_c.c in debug mode on mips64el and report here a backtrace? May be helpful to get some clue

fghoussen avatar Jul 09 '22 15:07 fghoussen

I am not so sure what I may read by ``in debug''. Can you specify the sequence of command to follow ?

jgmbenoit avatar Jul 10 '22 13:07 jgmbenoit

In bugreport-arpack-dseupd_c.c, why do you set double tol=+0x1p-53;? This is a weird format that I never seen before (not sure to understand it), but, it sounds like 1e-53. If so, why do you need such a strict tolerance? It is OK with just double tol=1.e-10;. Arpack is an iterative search/method so using very-very strict tolerance has no meaning (at least as I understand it: there is no way you get an exact result).

Are you sure you never hit ido == 2? (https://github.com/igraph/igraph/blob/89df4735b20737fdfcaa7457a594713ae9ac3e04/src/linalg/arpack.c#L989). In this case: https://github.com/opencollab/arpack-ng/blob/d280435c25a7665952475d3da6bcd6c05ee2e774/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp#L797

not so sure what I may read by ``in debug'

compile with -g -O0 and report a backtrace if possible

fghoussen avatar Jul 10 '22 14:07 fghoussen

In bugreport-arpack-dseupd_c.c, why do you set double tol=+0x1p-53;? This is a weird format that I never seen before (not sure to understand it), but, it sounds like 1e-53.

That is 2^-53, not 10^-53. Probably related to double storing 53 digits.

szhorvat avatar Jul 10 '22 15:07 szhorvat

  1. Try to use a regular tolerance (1.e-10 or 1.e-12 - below this has no meaning and may even trigger numerical instabilities).
  2. Make sure you don't miss cases where ido == 2 in the **aupd-while loop, otherwise, iterations are not over and you may don't even get an eigen vector! :D (in case bugreport-arpack-dseupd_c.c has missed ido cases, this may explain crash in dseupd call - I mean the math behind the scene need well-enough conditioned matrices, etc).
  3. Trap unexpected ido to make sure you don't screw the RCI pattern https://github.com/opencollab/arpack-ng/blob/d280435c25a7665952475d3da6bcd6c05ee2e774/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp#L802 As far as I remember, if you don't use implicit shifts, you never hit ido == 3 but otherwise you must handle this case too https://github.com/opencollab/arpack-ng/blob/d280435c25a7665952475d3da6bcd6c05ee2e774/SRC/dsaupd.f#L155 To me, the most simple solution is to always use exact shifts https://github.com/opencollab/arpack-ng/blob/d280435c25a7665952475d3da6bcd6c05ee2e774/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp#L725
  4. Break the **aupd-while loop only if ido == 99 https://github.com/opencollab/arpack-ng/blob/d280435c25a7665952475d3da6bcd6c05ee2e774/EXAMPLES/MATRIX_MARKET/arpackSolver.hpp#L807

I had a hard time with arpack too... Hope this helps

fghoussen avatar Jul 10 '22 16:07 fghoussen

Including @ntamas here regarding the gfortran calling convention.

szhorvat avatar Jul 10 '22 18:07 szhorvat

@szhorvat Re the calling convention: I think it depends on whether you are using gfortran or not, and whether you are using sibling-call optimizations. More details in the R blog, start reading from the "Strings in Fortran and LAPACK" section.

ntamas avatar Jul 11 '22 10:07 ntamas

To me, looks more like an arpack miuse that an arpack bug. @jgmbenoit: if so, when OK at your side, don't forget to close this issue at our side. Thanks!

fghoussen avatar Jul 14 '22 10:07 fghoussen

I will a have a closer look this week-end.

jgmbenoit avatar Jul 14 '22 10:07 jgmbenoit

2. Make sure you don't miss cases where ido == 2 in the **aupd-while loop, otherwise, iterations are not over and you may don't even get an eigen vector! :D (in case bugreport-arpack-dseupd_c.c has missed ido cases, this may explain crash in dseupd call - I mean the math behind the scene need well-enough conditioned matrices, etc).

bmat was set to 'I' for standard problems and there is no B matrix. Doesn't that mean that there will never be an ido == 2 value? That's my understanding from reading the docs here.

szhorvat avatar Jul 14 '22 11:07 szhorvat

Doesn't that mean that there will never be an ido == 2 value?

Sounds likely, but, not sure: doc is not clear to me (when bmat = 'I' you have actually B = Identity so that ido == 2 is not a no-op: you replace Y with X before next iteration which re-use the modified Y... Right?). From here, your biggest problem looks like to be the tolerance: below tol = 1.e-12, tol has no meaning (arpack is all about float and double which max tol is 1.e-12: this means you can hope at best to have 12 significant digits [in debug mode, less in release mode], so that if you ask for example tol = 1.e-15 the 13th, 14th, 15th will anyway have no meaning... But will take more time to compute and may increase risks of numerical instabilities). So if you ask tol = 2^-53 ?!... You are risking even more!

May be wrong but that's my impression (had a hard time with arpack long time ago!)

fghoussen avatar Jul 14 '22 12:07 fghoussen

@szhorvat: BTW, you may want/need to make some statistics with your baseline data/use-cases.

Say you ask for tol = 1.e-12 and maxit = 100, but that, you observe on your baseline data/use-cases that you get poor accuracy (say "bad" eigen vectors/values only 1.e-6 accurate): here, you may want/need to overkill maxit (maxit = 200 or maxit = 1000). tol is not the only stopping criterion for arpack: arpack will stop if tol is achieved or maxit is achieved. From my experience, keeping tol under control (w.r.t the need) but increasing maxit is a better way to get "accurate" results (w.r.t the need).

Note: with tol = 2^-53, it's likely you always stop at it = maxit as there is no way to reach such a tol! Check out how your baseline use-cases run

fghoussen avatar Jul 14 '22 12:07 fghoussen

I believe the 2^-53 tolerance did not come from igraph (doesn't seem to be present in the source code). I check for IDO values within igraph now: if it's not a value that's handled by the code, it throws an error. The test suite did not reveal such a case.

szhorvat avatar Jul 14 '22 12:07 szhorvat

If you don't have ido == 2, make sure you use a regular tolerance (1.e-12), check maxit and the quality of the results (are your vectors eigen up-to the tolerance you pass to arpack?).

arpack-ng provides arpackmm (https://github.com/opencollab/arpack-ng/tree/master/EXAMPLES/MATRIX_MARKET - cmake -DICBEXMM=ON ..) and pyarpack(https://github.com/opencollab/arpack-ng/tree/master/EXAMPLES/PYARPACK - cmake -DPYTHON3=ON .. based on Boost.python3) that may help you to check/compare results (if you have iparam[6] = 2/3, shifting and/or the kind of solver you use may have big impacts on results - this is easily checked/compared with arpackmm or pyarpack).

Hope this helps

fghoussen avatar Jul 14 '22 16:07 fghoussen

BTW, you need to set at least nbCV = 2 * nbEV + 1, but, if you set nbCV > 2 * nbEV + 1 you may get better/faster CV: arpackmm / pyarpack let you play with these parameters too.

fghoussen avatar Jul 14 '22 20:07 fghoussen

BTW, you need to set at least nbCV = 2 * nbEV + 1, but, if you set nbCV > 2 * nbEV + 1 you may get better/faster CV

I think these are already satisfied in our code in igraph; igraph_arpack_rnsolve() is "just" a generic ARPACK wrapper, but it tries to set nbCV automatically if the caller specifies 0 there (which is always the case in all places where we use the solver from igraph). This is done by this function.

Anyhow, we've made some changes in igraph and now try to handle all cases of ido (even ido == 2); not sure if this will help with the original issue, but I think we should definitely test it before pursuing this further in arpack-ng. @jgmbenoit igraph 0.10 is just around the corner, so I was wondering what your preference would be:

  • wait until igraph 0.10 comes out and then test whether this bug still appears on mips64el, or
  • patch igraph 0.9.x according to https://github.com/igraph/igraph/commit/a17105beb3e02cedd5f12dc4b4390041ba28ec58 and see if it solves the problem

ntamas avatar Jul 15 '22 11:07 ntamas

@ntamas It appears that the last igraph versions built well on all architecture (so on mips64el arch) that Debian officially supports (except on the s390x architecture because of a bug of mine in plfit). So I guess you can go ahead with igraph version 0.10 . I may package it this weekend, and I may have a closer look on this particular issue alongside. This bug is quite old, so a lot of things has changed in Debian since then. In particular, the GCC has been upgraded.

jgmbenoit avatar Jul 15 '22 13:07 jgmbenoit

I went back to a mips64el arch box to check the current status of the issue. The check-code bugreport-arpack-dseupd_c.c is still relevant as it fails on the mips64el arch box but it works on my amd64 box. Now gcc is gcc version 11.3.0. I ran several times the tests (make -C _BUILD_SHARED check): the tests 174 (igraph_pagerank) and 340 (example::eigenvector_centrality) failed each time on the mips64el arch box but succeeded on my amd64 box. This was for igraph-0.9.9+ds-1 without extra patch.

jgmbenoit avatar Jul 17 '22 21:07 jgmbenoit

It appears that patch a17105beb3e cannot be applied to version0.9.9.

I guess that number of changes to getting too big. So I suggest to come back as soon as version 0.10 is out (and packaged).

jgmbenoit avatar Jul 17 '22 22:07 jgmbenoit

That is 2^-53, not 10^-53. Probably related to double storing 53 digits.

Exactly. I used this format to be sure that the code plays with exactly the same doubles. We are debugging a weird issue here.

jgmbenoit avatar Jul 17 '22 22:07 jgmbenoit