ParMmg icon indicating copy to clipboard operation
ParMmg copied to clipboard

MPI errors

Open garth-wells opened this issue 4 years ago • 9 comments

I've been testing ParMmg and it looks promising. I managed to get it to run successfully for a small-ish problem using OpenMPI, but I haven't succeed with mpich for any cases, or with OpenMP for larger cases.

With OpenMPI I get a lot of warnings like

Read -1, expected 162457973, errno = 1

and

## Error: PMMG_check_extEdgeComm: rank 23:
       2 different points (dist 1.292470e-26:0.000000e+00,-1.136868e-13,0.000000e+00) in the same position (51435) of the external communicator 23 3 (6 th item):
       - point : 4.472009e+03 -5.431501e+02 1.254939e+02

For a large problem and OpenMPI I get segfaults at -- PHASE 3 : MERGE MESHES OVER PROCESSORS

With mpich I get a crash after a lot of ## Error: PMMG_check_extEdgeComm: rank 2: messages:

Fatal error in PMPI_Allreduce: Other MPI error, error stack:
PMPI_Allreduce(450)...........: MPI_Allreduce(sbuf=0x7ffc50901a88, rbuf=0x7ffc50901a8c, count=1, datatype=MPI_INT, op=MPI_MAX, comm=MPI_COMM_WORLD) failed
PMPI_Allreduce(436)...........: 
MPIR_Allreduce_impl(293)......: 
MPIR_Allreduce_intra_auto(178): 
MPIR_Allreduce_intra_auto(84).: 

Are these known issues?

garth-wells avatar Feb 28 '21 21:02 garth-wells

Hello, What versions of OpenMPI and mpich are you using? This could help to reproduce the problem (could you maybe also share a test, if it is not confidential?).

  1. The OpenMPI warning is internal to the OpenMPI library and I don't know its meaning.
  2. The PMMG_check_extEdgeComm function instead checks that the parallel communicators are OK: if it finds two different points in the same position it normally means that something is wrong in the problem setup, but in your case it seems just a problem of tolerance (I will have a look at this issue), and it is possible that ParMmg crashes only because the error is not well handled.
  3. When called from command line, by default ParMmg tries to merge the mesh on rank 0 and to save it to file. If the mesh is big, it could lack memory and crash. You can ask for a distributed output in VTK by giving a .vtu extension to the output mesh name.

I wil have a look at point 2. Yours, Luca

lcirrottola avatar Mar 01 '21 10:03 lcirrottola

Hello, What versions of OpenMPI and mpich are you using? This could help to reproduce the problem (could you maybe also share a test, if it is not confidential?).

OpenMPI v4.0.3 MPICH v3.3.2

I'm checking if I can share the mesh file and I'll get back to you.

  1. The OpenMPI warning is internal to the OpenMPI library and I don't know its meaning.
  2. The PMMG_check_extEdgeComm function instead checks that the parallel communicators are OK: if it finds two different points in the same position it normally means that something is wrong in the problem setup, but in your case it seems just a problem of tolerance (I will have a look at this issue), and it is possible that ParMmg crashes only because the error is not well handled.
  3. When called from command line, by default ParMmg tries to merge the mesh on rank 0 and to save it to file. If the mesh is big, it could lack memory and crash. You can ask for a distributed output in VTK by giving a .vtu extension to the output mesh name.

I tried using -out foo.vtu and I get:

## Error: Output format not yet implemented

My executable is linked to VTK (7).

I wil have a look at point 2. Yours, Luca

garth-wells avatar Mar 01 '21 13:03 garth-wells

Just on the file format, having looked at the source code looks like it needs to be -out foo.pvtu.

garth-wells avatar Mar 01 '21 13:03 garth-wells

Thanks! Yes sorry, my bad, only pvtu is supported, and ParMmg needs the VTK library to be specified when configuring with CMake.

lcirrottola avatar Mar 01 '21 13:03 lcirrottola

@lcirrottola unfortunately I can't share the meshes that are giving trouble.

For complicated meshes I'm getting VTK crashes in the output phase (to .pvtu).

I've tried the branch https://github.com/MmgTools/ParMmg/tree/feature/analysis (not knowing if this has relevant changes or not), and don't see the PMMG_check_extEdgeComm error and VTK output works, but I do get a lot of warnings:

## Warning: PMMG_locatePoint_errorCheck (rank 2, grp 0): at least one exhaustive search for point 20575 (tag 0), coords 3.856296e+03 6.336268e+02 -6.163393e+02

The branch does, however, seem to be a lot slower. It would take time to quantify precisely.

garth-wells avatar Mar 01 '21 18:03 garth-wells

OK, so I will try to reproduce the problem and come back to you, but it will be a bit harder.

https://github.com/MmgTools/ParMmg/tree/feature/analysis contains modifications for parallel remeshing on general surface (not supported for now) and it is not meant to be production-ready yet. Please stick to master for the stable releases or to develop for the last stable modifications.

lcirrottola avatar Mar 02 '21 08:03 lcirrottola

Hello, I haven't solved the other two issues yet, but problem 2 (## Error: PMMG_check_extEdgeComm) is related to this tolerance: https://github.com/MmgTools/ParMmg/blob/92663414bd77143a056bb25bd04e15f23d6a639c/src/coorcell_pmmg.h#L41 It is used to check that nodes on two sides of a parallel interface are the same (by checking the squared distance is lower than this tolerance): it is possible that it is too strict depending on the precision of the input data. Are you already using data read from distributed files in double precision, or partitioned by the program? If yes, we should definitely update this tolerance!

Yours, Luca

lcirrottola avatar Apr 22 '21 07:04 lcirrottola

Input file wasn't partitioned.

garth-wells avatar Apr 22 '21 10:04 garth-wells

The last develop should fix the coordinate check. Could you try to see if the MPI errors are still there? (They could have been caused by an improper error handling after the failed coordinate check). Thanks, Luca

lcirrottola avatar Apr 27 '21 13:04 lcirrottola