dolfinx_mpc icon indicating copy to clipboard operation
dolfinx_mpc copied to clipboard

Error in the periodic boundary condition in the parallel case

Open Alchem334 opened this issue 2 years ago • 1 comments

Greetings to everyone!

I found a bug in the periodic boundary condition. The error occurs only in the parallel case.

Here is a simple test case on which an error occurs https://gist.github.com/Alchem334/3c871adca89573d411c1da56988d84e2

The transient Laplace equation is solved on a 2D square grid. Non-uniform initial data are given. In the lower left corner the function is equal to 1, in the rest of the region it is equal to 0. On the faces of the square x = 0 and x = 1, a periodic boundary condition is set.

On one core, a plausible result is obtained.

1_core

On three cores, it is noticeable that the periodic condition is not satisfied.

3_cores

This error occurs due to the MPI call to MPI_Neighbor_alltoallv

https://github.com/jorgensd/dolfinx_mpc/blob/f7809a69699c4aaf2f602ed097a1ed4cf36e5968/cpp/utils.h#L235-L240

The vectors num_remote_slaves, remote_slave_disp_out, num_incoming_slaves, slave_disp_in can be empty on some cores in the parallel case, which causes the MPI call to silently throw error 13 MPI_ERR_UNKNOWN.

Here is the error workaround

  std::vector<int> num_remote_slaves_new = num_remote_slaves;
  std::vector<int> remote_slave_disp_out_new = remote_slave_disp_out;
  std::vector<int> slave_disp_in_new = slave_disp_in;
  std::vector<int> num_incoming_slaves_new = num_incoming_slaves;

  if(num_remote_slaves_new.size()==0) num_remote_slaves_new.resize(1);
  if(remote_slave_disp_out_new.size()==0) remote_slave_disp_out_new.resize(1);
  if(slave_disp_in_new.size()==0) slave_disp_in_new.resize(1);
  if(num_incoming_slaves_new.size()==0) num_incoming_slaves_new.resize(1);

  MPI_Neighbor_alltoallv(
      num_masters_per_slave.data(), num_remote_slaves_new.data(),
      remote_slave_disp_out_new.data(), dolfinx::MPI::mpi_type<std::int32_t>(),
      recv_num_masters_per_slave.data(), num_incoming_slaves_new.data(),
      slave_disp_in_new.data(), dolfinx::MPI::mpi_type<std::int32_t>(),
      master_to_slave);

I'm not sure if this kind of error occurs on any version of OpenMPI. The bug was found on OpenMPI-3.1.3

Alchem334 avatar Nov 04 '22 07:11 Alchem334