ParMmg Error phase 3 merging

Hello, When using ParMmg with very large meshes I get the following error :

-- PHASE 3 : MERGE MESHES OVER PROCESSORS rcv_buffer Exceeded max memory allowed: function: PMMG_gather_parmesh, file: /gpfs/projects/pr1eny00/PARMMG/ParMmg/src/mergemesh_pmmg.c, line: 1152 Fatal error in PMPI_Gatherv: Invalid buffer pointer, error stack: PMPI_Gatherv(1001): MPI_Gatherv failed(sbuf=0x46b31fc8, scount=156390683, MPI_CHAR, rbuf=(nil), rcnts=0x1c89408, displs=0x212c808, MPI_CHAR, root=0, MPI_COMM_WORLD) failed PMPI_Gatherv(887).: Null buffer pointer

Can anyone provide me with an explanation why this happens ?

Sincerely,

PA M

Oct 08 '21 14:10 PAMIMFT

Hello, This error is rude but honest: probably the mesh is to large to be gathered on a single MPI process (so, in the memory of one node only). This step is needed only when a centralized output is required (i.e. all the mesh is saved in a single file). Some hints to solve the issue:

Have you tried to check the -mparameter? It allows you to authorize ParMmg to really take all the memory of a computing node by setting a value in MB that is higher (or equal) to the memory available on a node (by default, ParMmg tries to leave some memory free for other applications).
If the previous step does not work, the mesh is really to big to be handled on a single computing node. You can try different output format:
- parallel VTU if you specify an output file (with -o) whom you give the extension .pvtu, or
- one file per process in Medit format (.meshextension) if you specify the option -distributed-output. (we are also working on an HDF5 output for the future)

Hope this helps, Luca

Oct 08 '21 14:10 lcirrottola

Dear Luca,

Thanks a lot for your swift reply. Since yesterday I have tried out some of your solutions. First, I knew about the -m option, since I had posted a similar thread on the MMG forum a few months ago, but I was told MMG could not handle meshes larger than 100M elements (I will go up to 1B elements so...). When I discovered ParMmg a few weeks ago, I thought "wow, now in parallel there may be a way out of this". Unfortunately the -m option still does not change the outcome. I have also tried to play around with my SLURM submission script and the available large-memory nodes on the cluster MareNostrum, what somehow increased the available memory for the rank 0 process, but still it crashes in the end of the day (see my script below). Maybe you know how to make more memory available to the first process, I have limited knowledge in HPC.

Concerning your point 2, I could save the solutions under several .vtu format files + a .pvtu but I need to convert it back to .mesh or .msh v2 since I need to load it back to HIP (then AVBP, you may know about it). Moreover I need to conserve the patch numbers, and so far I don't know whether there is a creation of specific patches at the interfaces between the solutions, and it I can glue it back together while maintaining coherence with the original patches. Since I am meshing a messy, very complex geometry, I don't want to set it back by hand, it would be too painful and I want a reproductible work-flow. I have tried meshio but it does not support pvtu. Also I have tried Gmsh with the command gmsh name_*.vtu -o output.msh -save but the output does not contain the elements. Anyways it's be nice if you knew a tool that conserves the original patches + merges the different vtu files, etc...

Thanks a lot,

PA M

#!/bin/bash #SBATCH --job-name=parmmg # nom du job #SBATCH --ntasks=96 # Nombre total de processus MPI #SBATCH --ntasks-per-node=1 # Nombre de processus MPI par noeud #SBATCH --ntasks-per-core=1 # 1 processus MPI par coeur physique (pas d'hyperthreading) #SBATCH --time=01:00:00 # Temps d’exécution maximum demande (HH:MM:SS) #SBATCH --output=%j.out # Nom du fichier de sortie #SBATCH --error=%j.out # Nom du fichier d'erreur (ici commun avec la sortie) #SBATCH --constraint=highmem

ulimit -s unlimited

module load vtk cat > ./job.conf << 'EOF' 0-47 /gpfs/projects/pr1eny00/PARMMG/ParMmg/build/bin/parmmg_O3 -hsiz 8e-5 -mmg-v 2 fluid_gm.mesh -m 1000000000 EOF

export I_MPI_SHM_LMT=shm time srun --kill-on-bad-exit --mpi=pmi2 -m block --resv-ports -n 48 --cpu_bind=rank --multi-prog ./job.conf >& output.out

Oct 09 '21 09:10 PAMIMFT

Hello,

I am not sure to understand what do you mean by "Patches": you need to keep the id of the rank to which each element belongs? or you need to keep the "material" id?

I have another question : how many nodes (or elements) should have the merged mesh? It is possible that ParMmg and/or Mmg keeps few arrays at a size larger than needed at the end of the process (it is planned to clean that.... but it is still not a priority comparing to other tasks). In this case, maybe you can try to save your mesh at a distributed ouput format (either .pvtu or distributed .mesh) and try to merge it with ParMmg but without adaptation ( -noinsert -noswap -nomove -niter 0 command line arguments).

For example, if I use only Medit file format, the command line for the adaptation will be the following: mpirun -n 48 /gpfs/projects/pr1eny00/PARMMG/ParMmg/build/bin/parmmg_O3 -hsiz 8e-5 -mmg-v 2 fluid_gm.mesh -m 1000000000 -distributed-output -o fluid_gm.o

And the command line to merge the mesh: mpirun -n 48 /gpfs/projects/pr1eny00/PARMMG/ParMmg/build/bin/parmmg_O3 -noinsert -noswap -nomove -niter 0 fluid_gm.o -m 1000000000 -centralized-output -o fluid_gm.o_centralized.mesh

Though, I am really not sure that it will work....

Best Regards, Algiane

Oct 18 '21 09:10 Algiane

What I call "patch" is a physical boundary (say, e.g. inlet, outlet, sides...). It appears that when I merge the meshes, the interface between the distributed meshes is itself a new (internal) patch, but I want to remove it. It should not exist in physical terms, and I have no idea what GMSH does with the connectivity at that point. So now the mesh is not adapted for CFD. FYI the mesh weights 100M elements so in theory it should fit the memory of the machines I am using. I suppose something wrong is going on here with memory management maybe ?

Also I have tried your solution but it did not work (same error).

Oct 20 '21 11:10 PAMIMFT

Hello, Parallel nodes are replicated in the distributed format (i.e. each partition saves the nodes of its parallel interface), while each tetrahedron is saved only once (as only one partition stores it in its data), giving you an "open" mesh on the iterfaces if you simply glue together the parallel pieces.

While it is possible to get rid of this double nodes easily when collecting nodes from the API functions, it is currently cumbersome with output files outside of ParMmg.

If you work with files, the best advice is to let ParMmg do the merging for you, as Algiane proposed, by calling it on the distributed *.mesh files with the -noinsert -noswap -nomove -niter 0 -centralized-output command line arguments. In this case ParMmg also uses the minimum possible amount of memory. So if in this case the program still failed to merge the mesh using a high value of the -m parameter, then probably the mesh would really be impossible to be merged on just one computing node. How big is your output mesh?

Hope it helps, Luca

Oct 20 '21 12:10 lcirrottola

Hi,

The error is probably linked to a bug in the memory count of Mmg (I will try to send you a patch within the next 10 days.

By the way, I know that an in-house version of parallel mesh adaptation (that uses Mmg) has been developped at Cerfacs by Gabriel Stafflebach: maybe using this version can solve your initial issue with Mmg that fails on too large meshes?

Best Regards, Algiane

Nov 05 '21 13:11 Algiane

My mesh may be quite big. I plan to reach approximately 1B elements. I have been using the tool developed by Gabriel, my lab has thorough collaboration with CERFACS. Long story short, it is currently only supported on Jean-Zay and strangely the tool generates weird errors on my mesh. So I cannot use it for now. That is exactly the reason why I moved to ParMMG ! ^^

Nov 05 '21 13:11 PAMIMFT