pumi-pic Code crash loading partitioned mesh with some number of mesh partitions

While loading gmsh mesh file with partitioned mesh:

when some number of mesh partitions is used, the mesh loading will crash.
with other number of mesh partitions, the mesh loading works fine.
this is confirmed with the Comet code loading several different mesh files and partitions: some work fine, some crashed.
it is also confirmed when loading the same mesh and partition file using ptn_loading unit test: the crash behavior is the same as above.

I will include a test mesh file separately.

Stack trace from the core dump file using ptn_loading, which is very similar from stack trace generated with Comet code.

#0  0x000014beee11f6c8 in PMPI_Irecv () from /opt/cray/pe/lib64/libmpi_gnu_123.so.12
#1  0x000014bef05c511a in MPI_Irecv (buf=0x4271e544, count=592923, datatype=-1946157051, source=<optimized out>, tag=<optimized out>, comm=<optimized out>, request=<optimized out>) at darshan-apmpi.c:842
#2  0x0000000000881fdc in pumipic::ParticleBalancer::ParticleBalancer(pumipic::Mesh&) ()
#3  0x000000000083a77e in pumipic::Mesh::constructPICPart(Omega_h::Mesh&, std::shared_ptr<Omega_h::Comm>, Omega_h::Read<int>, Omega_h::Write<int>, Omega_h::Write<int>, bool) ()
#4  0x000000000083cb1c in pumipic::Mesh::Mesh(Omega_h::Mesh&, Omega_h::Read<int>, int, int) ()
#5  0x00000000004279f5 in main ()

Job submission scripts on Polaris using 4 mesh partitions:

mpiexec -n ${NTOTRANKS} --ppn ${NRANKS_PER_NODE} --depth=${NDEPTH} --cpu-bind depth \
--env OMP_NUM_THREADS=${NTHREADS} -env OMP_PLACES=threads ./set_affinity_gpu_polaris.sh \
./ptn_loading 2d_cylinder.msh 2d_cylinder_4.ptn 1 3

Sep 25 '25 16:09 zhangchonglin

Test mesh is in this repository, where Polaris run scripts are also included:

with 4 mesh partitions, the unit test was running fine;
with 8 mesh partitions, the unit test crashed.

Sep 25 '25 19:09 zhangchonglin

Hello, thank you for reporting this. Is this issue observed only in Polaris, or have you encountered it in other environments as well?

Sep 29 '25 18:09 Sichao25

@Sichao25: Hi Sichao, this is only observed on Polaris due to the large mesh size. I tried to reproduce this issue on my CentOS workstation, but the GPU memory is too small to accommodate the large mesh size.

Sep 29 '25 19:09 zhangchonglin

@Sichao25: Quick update: I rebuild PUMIPic on CPU and on a CentOS desktop computer, and was able to run the test case above fine using either 4 or 8 mesh partitions. So it seems this issue is only happening on GPU.

Update: I tried several different meshes and number of partitions, all the partitions that are previously crashing on Polaris GPU are working fine on CentOS with CPU build.

Oct 01 '25 14:10 zhangchonglin

I tried this on SCOREC and haven't been able to reproduce the error with CUDA so far. This is likely specific to the environment.

Oct 01 '25 17:10 Sichao25

@Sichao25:

So with both 4 mesh partitions and 8 mesh partitions, the case was running fine, correct?
If yes, then this may suggest the issue is specific to Polaris, but not CPU vs GPU, since Polaris is using a different MPI, cray-mpich/8.1.32, where on CentOS I am using mpich 3.4.3.

Oct 01 '25 18:10 zhangchonglin

Yes, both cases run without error on SCOREC with ptn_loading.

Oct 01 '25 18:10 Sichao25

I reproduced the bug on Polaris. The issue appears to be related to MPI_Irecv failing when using a strided datatype created by MPI_Type_vector(core_nents, 1, nbuffers, MPI_INT, &bufferStride). Using non-strided data can avoid the problem, but this workaround may affect performance given the mesh size.

Since I does not encounter the same issue in other environments, it may be specific to the Cray-MPICH on Polaris. Alternative MPI implementations may resolve the issue, but I have not tested them yet.

Oct 17 '25 22:10 Sichao25

@Sichao25: thank you for the update Sichao!

Could you elaborate on the issue here a little bit, with referencing to the specific code section?
When you say that non-strided data can avoid the problem, could you push this workaround to the PUMIPic repository for me to test its performance impact?
And finally, do you think if this issue is a bug of Cray-MPICH? Or is this simply some missing environmental variables that we should set on Polaris when using Cray-MPICH? In either case, it would be beneficial to raise this issue to ALCF and ask for their help.

What do you guys think? @cwsmith @jacobmerson

Oct 19 '25 21:10 zhangchonglin

For the record, @onkarsahni suggested looking at cray-mpich docs and variables controlling buffer sizes:

This seems old but check Slide/Page 10: http://www.archer.ac.uk/training/courses/craytools/pdf/mpi-variables.pdf

Oct 20 '25 17:10 cwsmith

@cwsmith, @Sichao25, @jacobmerson: copying email response here for better record:

I tried to adjust the environmental variable to several larger values in page 10: MPICH_GNI_NUM_BUFS, and the code still crashed.
When Sichao said about Strided data type in MPI_Irecv, do you see if we are still missing something here with above environmental variables, or some other environmental variables need to be set?

Oct 20 '25 17:10 zhangchonglin

@Sichao25: thank you for the workaroud:

with your new branch, the partitioned mesh creation is working fine.
with the workaround, there is some performance drop in mesh initialization (including PICparts creation, etc.); since this is done only once at the beginning of the simulation, it may be acceptable.
The results below are for the 2D cylinder mesh with 16 mesh partitions, and a total number of triangle elements ~4.74 million.
I think it might make sense to see if this is a cray-mpich issue or missing environmental variable issue.

non-strided data type

initialization mesh                                8.94713 5               8.94604 0            8.94693            1

strided data type

initialization mesh                                8.15834 8               8.15709 0            8.15804            1

Oct 20 '25 21:10 zhangchonglin

Thanks for the feedback, the performance drop is expected. Ideally, we should make MPI_Irecv work with MPI_Type_Vector on Polaris. We could probably ask the ALCF team for suggestions, since no similar issues have been observed on other machines.

Oct 20 '25 21:10 Sichao25