PeleLM icon indicating copy to clipboard operation
PeleLM copied to clipboard

MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 6. NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.

Open WHEREISSHE opened this issue 2 years ago • 4 comments

Hi,there. When I running Exec/RegTests/EB_FlamePastCylinder, the make process went well but something work not properly during ./PeleLM3d.gnu.MPI.ex inputs.3d-regt . The error occured with "amrex::Abort::0::MLMG failed !!! SIGABRT See Backtrace.0 file for details MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD with errorcode 6.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them."

WHEREISSHE avatar May 22 '22 03:05 WHEREISSHE

This indicates that one of the linear solver did not managed to converge, probably because the tolerances are too tight. Could you increase the linear solver verbose:

mac_proj.verbose  = 2
nodal_proj.verbose  = 2

and re-try. The solver most likely hangs slightly above the required tolerance. Once you've identified which solver is responsible for the problem, it is possible to relax the tolerance slightly.

esclapez avatar May 23 '22 04:05 esclapez

This indicates that one of the linear solver did not managed to converge, probably because the tolerances are too tight. Could you increase the linear solver verbose:

mac_proj.verbose  = 2
nodal_proj.verbose  = 2

and re-try. The solver most likely hangs slightly above the required tolerance. Once you've identified which solver is responsible for the problem, it is possible to relax the tolerance slightly.

Thank you! I followed your instruction, but it didn't work properly with the notion---MLMG: Failed to converge after 100 iterations. resid, resid/bnorm = 3.084014821e-09, 1.680522099e-12 amrex::Abort::0::MLMG failed !!! SIGABRT Should I tune other parameters? More specificly, how could I find suitable parameters to be optimized?

WHEREISSHE avatar May 23 '22 07:05 WHEREISSHE

This indicates that one of the linear solver did not managed to converge, probably because the tolerances are too tight. Could you increase the linear solver verbose:

mac_proj.verbose  = 2
nodal_proj.verbose  = 2

and re-try. The solver most likely hangs slightly above the required tolerance. Once you've identified which solver is responsible for the problem, it is possible to relax the tolerance slightly.

It seemed worked properly when I increased the tolerance to 1.0e-8. But I still have no idea if this value is suitable. Actually, I am wondering how to choose good values for tolerance and verbose. Thanks.

WHEREISSHE avatar May 23 '22 08:05 WHEREISSHE

So, if you keep the verbose to 2, the standard output will get significantly longer but you will be able to keep track of the linear solver(s) behavior. When it comes to tolerances, the one you mostly want to control is the relative one:

mac_proj.rtol = 1e-10
nodal_proj.rtol = 1e-10

And in my experience, going higher than 1e-9 might indicates that something is wrong in the setup, unless you have added multiple levels and have very fine grids. From the message you pasted above,

MLMG: Failed to converge after 100 iterations. resid, resid/bnorm = 3.084014821e-09, 1.680522099e-12

the relative tolerance hanged ~1e-12, so going to 1e-10 should be relaxing the constraint enough for the solver to move forward.

esclapez avatar Jun 03 '22 18:06 esclapez