ompi
ompi copied to clipboard
Coll/base: FIX bruck when sdtype and rdtype are different
Coll/base/coll_base_alltoall.c:ompi_coll_base_alltoall_intra_bruck FIX: switch tmpbuf from sdtype to rdtype representation to avoid illegal use of sdtype on rbuf and wrong behaviour when sdtype and rdtype are different
OPTIM: use ompi_datatype_create_indexed instead of ompi_datatype_create_indexed_block to reduce the number of temporary buffers used
Signed-off-by: Florent Germain [email protected]
Can one of the admins verify this patch?
ok to test
@FlorentGermain-Bull Thanks for the patch! can you please share a test program that evidences the issue? I will then add it to our test suite.
There is a bug in the modified bruck alltoall algorithm in coll/base module. It is the algorithm 3 when using tuned dynamic choice and tuned choose it be default on some configurations.
Here is the symptom: When sdtype and rdtype have a different type map:
- The result is wrong
- Some data not covered by the described rbuf can be altered during the collective communication
Here is a reproducer that triggers this bug:
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
int size;
MPI_Comm_size(MPI_COMM_WORLD, &size);
int my_rank;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
int my_values[size*3];
for(int i = 0; i < size; i++)
{
my_values[3*i] = my_rank;
my_values[3*i+1] = -my_rank;
my_values[3*i+2] = i;
}
if (my_rank == 1) {
for(int i=0; i<size; i++) {
printf("Process %d, SEND = [%d %d %d].\n", my_rank, my_values[3*i], my_values[3*i+1], my_values[3*i+2]);
}
}
int buffer_recv[size*2];
MPI_Datatype vector;
MPI_Type_vector(2, 1, 2, MPI_INT, &vector);
MPI_Type_commit(&vector);
MPI_Alltoall(&my_values, 1, vector, buffer_recv, 2, MPI_INT, MPI_COMM_WORLD);
if (my_rank == 1) {
for(int i=0; i<size; i++) {
printf("Process %d, RECV = [%d %d].\n", my_rank, buffer_recv[2*i], buffer_recv[2*i +1]);
}
}
MPI_Finalize();
return EXIT_SUCCESS;
}
The sendbuf is a vector of 2 elements of 1 int: sender -1 receiver The recvbuf is 2 contiguous int: sender receiver
Before the fix:
$ for i in 1 2 3 4
do
echo -e "\nAlgo $i"
OMPI_MCA_coll_tuned_use_dynamic_rules=true OMPI_MCA_coll_tuned_alltoall_algorithm=$i srun -N2 -n4 -prome7402 ./a.out
done
Algo 1
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].
Process 1, RECV = [2 1].
Process 1, RECV = [3 1].
Algo 2
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].
Process 1, RECV = [2 1].
Process 1, RECV = [3 1].
Algo 3
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [1 0].
Process 1, RECV = [1 0].
Process 1, RECV = [3 32765].
Process 1, RECV = [0 1].
Algo 4
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].
Process 1, RECV = [2 1].
Process 1, RECV = [3 1].
After the fix:
$ for i in 1 2 3 4
do
echo -e "\nAlgo $i"
OMPI_MCA_coll_tuned_use_dynamic_rules=true OMPI_MCA_coll_tuned_alltoall_algorithm=$i srun -N2 -n4 -prome7402 ./a.out
done
Algo 1
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].
Process 1, RECV = [2 1].
Process 1, RECV = [3 1].
Algo 2
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].
Process 1, RECV = [2 1].
Process 1, RECV = [3 1].
Algo 3
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].
Process 1, RECV = [2 1].
Process 1, RECV = [3 1].
Algo 4
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].
Process 1, RECV = [2 1].
Process 1, RECV = [3 1].
Detailed bug behaviour: To save memory, classic bruck exchanges are done between tmpbuf and rbuf. Tmpbuf was using sdtype datatype. Exchanged were done through an indexed datatype derivated on sdtype. This indexed (new_ddt) was used both on tmpbuf and rbuf (illegal use of sdtype on rbuf (symptom 1)) At the end of the collective communication, the copy from tmpbuf to rbuf is done using rdtype on both sides (illegal use of rdtype on tmpbuf (symptom 2))
Correction:
Use rdtype representation for tmpbuf.
All the communications are based on rdtype.
Use ompi_datatype_sndrcv
instead of ompi_datatype_copy_content_same_ddt
to perform copy from sbuf to tmpbuf.
bot:aws:retest
This is almost certainly an issue on v5.0.x. What about v4.1.x and v4.0.x?
This is almost certainly an issue on v5.0.x. What about v4.1.x and v4.0.x?
The bug is already there in v2.0.4 version. I think this fix can be applied on all supported versions
There are 2 things addressed in this patch, one optimization and one correctness.
- correctness the incorrect use of the rdtype on the step 3. A simpler fix would have been to replace the use in Step 3 of ompi_datatype_copy_content_same_ddt with ompi_datatype_sndrcv (similar with what is proposed here, but without changing the logic of the implementation from using sender type to using receiver type on the temporary buffer).
-
optimization the move from
ompi_datatype_create_indexed
toompi_datatype_create_indexed_block
, allowing to reduce the number of temporary buffers used in the operation.
Please change the text of the PR and the log of the commit to reflect the proposed changes.
There are 2 things addressed in this patch, one optimization and one correctness.
- correctness the incorrect use of the rdtype on the step 3. A simpler fix would have been to replace the use in Step 3 of ompi_datatype_copy_content_same_ddt with ompi_datatype_sndrcv (similar with what is proposed here, but without changing the logic of the implementation from using sender type to using receiver type on the temporary buffer).
- optimization the move from
ompi_datatype_create_indexed
toompi_datatype_create_indexed_block
, allowing to reduce the number of temporary buffers used in the operation.Please change the text of the PR and the log of the commit to reflect the proposed changes.
For the point 1. the use of a sdtype based datatype on rbuf on step 2 is also erroneous. If we keep tmpbuf as sdtypes, we also need to create a second new_ddt
based on rdtype and replace the ompi_datatype_copy_content_same_ddt
with ompi_datatype_sndrcv
in the loop which would lead to a performance drop.
I have another optimization that basically remove the need for the displ array. Instead of using MPI level functions to create the intermediary datatypes, it used OMPI capabilities to create specialized datatypes with a gap in the beginning. Unfortunately I do not have write access to your repo I cannot push.
Here is a quick manual path: replace the create_indexed_block call with the following code and remove the displs array.
int vlen = (distance * 2) < size ? distance : size - distance;
struct ompi_datatype_t *ddt;
ompi_datatype_create_vector( 1,
vlen,
vlen * 2,
sdtype, &ddt);
new_ddt = ompi_datatype_create( ddt->super.desc.used );
ompi_datatype_add(new_ddt, ddt, (size + distance - 1) / (distance * 2), distance * sext, vlen * 2 * sext);
/* Commit the new datatype */
err = ompi_datatype_commit(&new_ddt);
if (err != MPI_SUCCESS) { line = __LINE__; goto err_hndl; }
Using these routines we can create datatype in a more flexible way than using the MPI API, which means we can build the datatype representing the bruck exchange buffer on the fly, and skip the Step 1 (local rotation).
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
/azp run
Azure Pipelines successfully started running 1 pipeline(s).
@FlorentGermain-Bull @bosilca @ggouaillardet Is this PR ready to merge?
@FlorentGermain-Bull @bosilca @ggouaillardet Is this PR ready to merge?
Yes, it is ok for me
@bosilca is the change in the datatype building ok for you?
@bosilca ^^^
I like sndrcv() for the data copying, and I wish an analogue of ompi_datatype_add() was part of the MPI standard, that's a nice API option to have. I think it looks good to me with one caveat: shouldn't new_ddt get freed (or deleted/destroyed, but I'm thinking freed in this context)?