dolfinx_mpc
dolfinx_mpc copied to clipboard
Error in the periodic boundary condition in the parallel case
Greetings to everyone!
I found a bug in the periodic boundary condition. The error occurs only in the parallel case.
Here is a simple test case on which an error occurs https://gist.github.com/Alchem334/3c871adca89573d411c1da56988d84e2
The transient Laplace equation is solved on a 2D square grid. Non-uniform initial data are given. In the lower left corner the function is equal to 1, in the rest of the region it is equal to 0. On the faces of the square x = 0 and x = 1, a periodic boundary condition is set.
On one core, a plausible result is obtained.
On three cores, it is noticeable that the periodic condition is not satisfied.
This error occurs due to the MPI call to MPI_Neighbor_alltoallv
https://github.com/jorgensd/dolfinx_mpc/blob/f7809a69699c4aaf2f602ed097a1ed4cf36e5968/cpp/utils.h#L235-L240
The vectors num_remote_slaves, remote_slave_disp_out, num_incoming_slaves, slave_disp_in can be empty on some cores in the parallel case, which causes the MPI call to silently throw error 13 MPI_ERR_UNKNOWN.
Here is the error workaround
std::vector<int> num_remote_slaves_new = num_remote_slaves;
std::vector<int> remote_slave_disp_out_new = remote_slave_disp_out;
std::vector<int> slave_disp_in_new = slave_disp_in;
std::vector<int> num_incoming_slaves_new = num_incoming_slaves;
if(num_remote_slaves_new.size()==0) num_remote_slaves_new.resize(1);
if(remote_slave_disp_out_new.size()==0) remote_slave_disp_out_new.resize(1);
if(slave_disp_in_new.size()==0) slave_disp_in_new.resize(1);
if(num_incoming_slaves_new.size()==0) num_incoming_slaves_new.resize(1);
MPI_Neighbor_alltoallv(
num_masters_per_slave.data(), num_remote_slaves_new.data(),
remote_slave_disp_out_new.data(), dolfinx::MPI::mpi_type<std::int32_t>(),
recv_num_masters_per_slave.data(), num_incoming_slaves_new.data(),
slave_disp_in_new.data(), dolfinx::MPI::mpi_type<std::int32_t>(),
master_to_slave);
I'm not sure if this kind of error occurs on any version of OpenMPI. The bug was found on OpenMPI-3.1.3