dash icon indicating copy to clipboard operation
dash copied to clipboard

DART calls influencing each other [MPICH 3.2]

Open stiefn opened this issue 7 years ago • 17 comments

I have some problems with DART calls. Here is a minimum example that results in the last DART call to be non-blocking on Unit 1, but blocking on Unit 0 which makes the program to never terminate:

#include <libdash.h>

int main(int argc, char* argv[]) {
  dash::init(&argc, &argv);

  std::vector<int> send1;
  std::vector<std::size_t> send1_count;
  std::vector<std::size_t> send1_displs;
  std::vector<int> recv1;
  std::vector<std::size_t> recv1_count;
  std::vector<std::size_t> recv1_displs;
  if(dash::myid() == 0) {
    send1.resize(5);
    send1_count = { 0, 20 };
    send1_displs = { 0, 0 };
    recv1 = { };
    recv1_count = { 0, 0 };
    recv1_displs = { 0, 0 };
  } else {
    send1 = { };
    send1_count = { 0, 0 };
    send1_displs = { 0, 0 };
    recv1.resize(5);
    recv1_count = { 20, 0 };
    recv1_displs = { 0, 20 };
  }
  dart_alltoallv(send1.data(), send1_count.data(), send1_displs.data(), 
      DART_TYPE_BYTE, recv1.data(), recv1_count.data(), recv1_displs.data(),
      dash::Team::All().dart_id());

  int send2 = 5;
  std::vector<std::size_t> recv2(2);
  dart_allgather(&send2, recv2.data(), sizeof(std::size_t), DART_TYPE_BYTE, 
      dash::Team::All().dart_id());

  std::vector<int> send3;
  std::vector<std::size_t> send3_count;
  std::vector<std::size_t> send3_displs;
  std::vector<int> recv3;
  std::vector<std::size_t> recv3_count;
  std::vector<std::size_t> recv3_displs;
  if(dash::myid() == 0) {
    send3 = { 2, 2, 4, 3, 4, 1, 2, 0 };
    send3_count = { 8, 0 };
    send3_displs = { 0, 8 };
    recv3.resize(9);
    recv3_count = { 8, 1 };
    recv3_displs = { 0, 8 };
  } else {
    send3 = { 6, 6, 7, 9, 5, 9, 6, 8 };
    send3_count = { 1, 7 };
    send3_displs = { 0, 1 };
    recv3.resize(7);
    recv3_count = { 0, 7 };
    recv3_displs = { 0, 0 };
  }
  dart_alltoallv(send3.data(), send3_count.data(), send3_displs.data(),
      DART_TYPE_INT, recv3.data(), recv3_count.data(), recv3_displs.data(), 
      dash::Team::All().dart_id());

  dash::finalize();
  return 0;
} 

Each of the calls run perfectly fine on their own, but in combination I am getting this weird behaviour. Commenting out the second DART call also results in the following error for the third DART call:

Message from rank 1 to tag 10 truncated; 20 bytes received but buffer size is 4

This seems clearly wrong, as unit 1 should only be sending 4 bytes according to the code.

I am no MPI expert, so maybe my code is buggy but it seems odd that each of the calls run successfully on their own and only the combination creates these problems.

MPI version: MPICH 3.2

stiefn avatar Nov 17 '17 12:11 stiefn

Mhh, that code looks fishy in multiple places (for example: why do send1_count and send3_count have different values on both processes?). Have you tried doing the same thing in MPI without DART in between? (the collectives in DART are mostly just forwarding things to the MPI collectives)

I suggest you look at the MPI standard and if you have a correct MPI example that fails if ported to DART I'm happy to help :)

devreal avatar Nov 17 '17 15:11 devreal

I have actually just created a minimum example from the values my real code generates. send1_count and send3_count have different values because the processes have to exchange different amoutns of data (hence the alltoall*v*). I will try doing the same thing in MPI, but if I get the same results, can I assume my MPI version is broken?

stiefn avatar Nov 17 '17 16:11 stiefn

I apologize, it was late last night. You're right, for ~allgatherv~ alltoallv it makes sense. However, I cannot find the implementation of dart_alltoallv. Is that implemented in a private branch?

Also, are you running on a 32-bit or 64-bit machine? In case of the latter, the allgatherv is illegal because sizeof(int) != sizeof(std::size_t). This might not be the reason for why the processes hang, though.

devreal avatar Nov 18 '17 01:11 devreal

I just checked the commit log and it seems I have implemented both dart_alltoall and dart_alltoallv myself on my branch. That is a bit embarrassing as I don't even remember it at all... Anyway, I just had a quick look at the code and found a possible memory leak in dart_alltoallv, but i don't think it is related to my problem at all. Couldn't find any other problems right now, but will check it more in-depth later.

Do you think you could look at it too? It's located in the feat-graph branch.

stiefn avatar Nov 18 '17 13:11 stiefn

As for the alltoallv being illegal: The processes exchange ints and only the counts and displacements are std::size_t. dart_alltoallv converts those size_t arrays to int arrays internally, so that should not be a problem. I think when I created the code, I took dart_allgatherv as an example.

stiefn avatar Nov 18 '17 13:11 stiefn

I meant this line:

  int send2 = 5;
  std::vector<std::size_t> recv2(2);
  dart_allgather(&send2, recv2.data(), sizeof(std::size_t), DART_TYPE_BYTE, 
      dash::Team::All().dart_id());

send2 is an int (4 Byte) but you are treating it as size_t (potentially 8 Byte).

Anyway, I looked at the implementation of dart_alltoallv and couldn't find anything immediately suspicious. I get an error about truncated messages from Open MPI 3.0.0, even without the allgather in between. I converted your example to MPI and the error does not manifest there. It might be worth checking the values that come out to see whether the alltoallv behave as expected.

devreal avatar Nov 18 '17 15:11 devreal

Ah, sorry, yes. That was a mistake. The original code handles this correctly though.

stiefn avatar Nov 18 '17 18:11 stiefn

The values the first dart_alltoallv call returns should be correct. I just added some random data to the examples to make sure and they are indeed correct.

stiefn avatar Nov 18 '17 18:11 stiefn

if (sendbuf == recvbuf || NULL == sendbuf) {
  sendbuf = MPI_IN_PLACE;
}

Seems like these lines are causing the problem. However, I don't get how this affects the second dart_alltoallv call...

stiefn avatar Nov 21 '17 10:11 stiefn

because the first dart_alltoallv call is in_place on unit 1, but not on unit 0, the call is blocking on unit 0 and then couples with the second dart_alltoallv call on unit 1.

What is the reason for the NULL == sendbuf? Can i safely remove it?

stiefn avatar Nov 21 '17 12:11 stiefn

MPI 3.1 standard, p. 169:

The “in place” option for intracommunicators is specified by passing MPI_IN_PLACE to the argument sendbuf at all processes.

This is an optimization that is sensible and we should keep that option. Unfortunately, the developer who added this neat feature forgot to document it. Patches are welcome ;) (please open a pull request if you want to add the documentation) A debug message that in-place is being used might be helpful as well.

In your case, I suspect that the empty std::vector returns NULL in the call to data(), which then triggers the use of MPI_IN_PLACE, just not on all processes. Things go south from there for MPI... Good catch!

devreal avatar Nov 21 '17 13:11 devreal

Thanks! My understanding is that in_place is used when sendbuf == recvbuf, so I think it should be safe to just remove the NULL == sendbuf part. This should probably be done for dart_allgatherv in the main branch as well.

stiefn avatar Nov 21 '17 13:11 stiefn

Please check whether anyone actually calls collectives with NULL == sendbuf before you remove this feature. I'm fine with changing the convention to sendbuf == recvbuf triggering the use of MPI_IN_PLACE. It just has to be documented :)

devreal avatar Nov 21 '17 13:11 devreal

I am actually not sure how to do that because if it is used depends on the data that is passed to the functions. And that is kind of dependent on the user.

For *v* collectives, in_place only makes sense, if both sendbuf AND recvbuf are NULL and that is already covered with the sendbuf== recvbuf condition

stiefn avatar Nov 21 '17 14:11 stiefn

For v collectives, in_place only makes sense, if both sendbuf AND recvbuf are NULL and that is already covered with the sendbuf== recvbuf condition

Again, the MPI standard states:

The “in place” option for intracommunicators is specified by passing MPI_IN_PLACE to the argument sendbuf at all processes. In such a case, sendcount and sendtype are ignored. The data to be sent is taken from the recvbuf and replaced by the received data. Data sent and received must have the same type map as specified by recvcount and recvtype.

So, even for v variants it makes sense to use it. I think if the user provides sendbuf == recvbuf it is clear what his intention is: send data from sendbuf and replace it by the received data. It just has to be made clear in the DART interface that if the user wants to trigger this behavior he has to do it on all units. Passing the same buffer for sendbuf and recvbuf without replacing sendbuf with MPI_IN_PLACE will likely lead to wrong results.

devreal avatar Nov 21 '17 23:11 devreal

Yes, in_place also makes sense here. Just not the sendbuf==NULL part which is why I don't think that checking whether this is used somewhere is actually needed.

How would you go about making things clear in the interface? Isn't it enough to state it in the doc comments of the function?

stiefn avatar Nov 23 '17 17:11 stiefn

Yes, in_place also makes sense here. Just not the sendbuf==NULL part which is why I don't think that checking whether this is used somewhere is actually needed.

You're right. And passing sendbuf==NULL can be perfectly reasonable if there is nothing to send.

How would you go about making things clear in the interface? Isn't it enough to state it in the doc comments of the function?

Yes, it should be added to the documentation. Something along the lines of

The user may pass the same buffer for \c sendbuf and \c recvbuf on all units, in which case data will be sent from and received into \c recvbuf. The contents of \c sendcnt and \c senddispls will be ignored in that case.

That's for dart_alltoallv, it should be adapted for the other relevant collectives.

devreal avatar Nov 24 '17 00:11 devreal