ompi Coll/base: FIX bruck when sdtype and rdtype are different

Coll/base/coll_base_alltoall.c:ompi_coll_base_alltoall_intra_bruck FIX: switch tmpbuf from sdtype to rdtype representation to avoid illegal use of sdtype on rbuf and wrong behaviour when sdtype and rdtype are different

OPTIM: use ompi_datatype_create_indexed instead of ompi_datatype_create_indexed_block to reduce the number of temporary buffers used

Signed-off-by: Florent Germain [email protected]

Jun 20 '22 10:06 FlorentGermain-Bull

Can one of the admins verify this patch?

Jun 20 '22 10:06 ompiteam-bot

ok to test

Jun 20 '22 10:06 ggouaillardet

@FlorentGermain-Bull Thanks for the patch! can you please share a test program that evidences the issue? I will then add it to our test suite.

Jun 20 '22 10:06 ggouaillardet

There is a bug in the modified bruck alltoall algorithm in coll/base module. It is the algorithm 3 when using tuned dynamic choice and tuned choose it be default on some configurations.

Here is the symptom: When sdtype and rdtype have a different type map:

The result is wrong
Some data not covered by the described rbuf can be altered during the collective communication

Here is a reproducer that triggers this bug:

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int main(int argc, char* argv[])
{
    MPI_Init(&argc, &argv);
 
    int size;
    MPI_Comm_size(MPI_COMM_WORLD, &size);
 
    int my_rank;
    MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
    int my_values[size*3];
    for(int i = 0; i < size; i++)
    {   
        my_values[3*i] =  my_rank;
        my_values[3*i+1] = -my_rank;
        my_values[3*i+2] = i;
    }   
    if (my_rank == 1) {
        for(int i=0; i<size; i++) {
            printf("Process %d, SEND = [%d %d %d].\n", my_rank, my_values[3*i], my_values[3*i+1], my_values[3*i+2]);
        }   
    }   
    int buffer_recv[size*2];
    MPI_Datatype vector;
    MPI_Type_vector(2, 1, 2, MPI_INT, &vector);
    MPI_Type_commit(&vector);
    MPI_Alltoall(&my_values, 1, vector, buffer_recv, 2, MPI_INT, MPI_COMM_WORLD);
    if (my_rank == 1) {
        for(int i=0; i<size; i++) {
            printf("Process %d, RECV = [%d %d].\n", my_rank, buffer_recv[2*i], buffer_recv[2*i +1]);
        }   
    }   
    MPI_Finalize();
 
    return EXIT_SUCCESS;
}

The sendbuf is a vector of 2 elements of 1 int: sender -1 receiver The recvbuf is 2 contiguous int: sender receiver

Before the fix:

$ for i in 1 2 3 4
do
    echo -e "\nAlgo $i"
    OMPI_MCA_coll_tuned_use_dynamic_rules=true OMPI_MCA_coll_tuned_alltoall_algorithm=$i srun -N2 -n4 -prome7402 ./a.out
done

Algo 1
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].
Process 1, RECV = [2 1].
Process 1, RECV = [3 1].

Algo 2
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1].   

Algo 3
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [1 0].   
Process 1, RECV = [1 0].   
Process 1, RECV = [3 32765].
Process 1, RECV = [0 1].   

Algo 4
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].   
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1].

After the fix:

$ for i in 1 2 3 4
do
    echo -e "\nAlgo $i"
    OMPI_MCA_coll_tuned_use_dynamic_rules=true OMPI_MCA_coll_tuned_alltoall_algorithm=$i srun -N2 -n4 -prome7402 ./a.out
done

Algo 1
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].
Process 1, RECV = [2 1].
Process 1, RECV = [3 1].

Algo 2
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1].   

Algo 3
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].   
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1].   

Algo 4
Process 1, SEND = [1 -1 0].
Process 1, SEND = [1 -1 1].
Process 1, SEND = [1 -1 2].
Process 1, SEND = [1 -1 3].
Process 1, RECV = [0 1].   
Process 1, RECV = [1 1].   
Process 1, RECV = [2 1].   
Process 1, RECV = [3 1].

Detailed bug behaviour: To save memory, classic bruck exchanges are done between tmpbuf and rbuf. Tmpbuf was using sdtype datatype. Exchanged were done through an indexed datatype derivated on sdtype. This indexed (new_ddt) was used both on tmpbuf and rbuf (illegal use of sdtype on rbuf (symptom 1)) At the end of the collective communication, the copy from tmpbuf to rbuf is done using rdtype on both sides (illegal use of rdtype on tmpbuf (symptom 2))

Correction: Use rdtype representation for tmpbuf. All the communications are based on rdtype. Use ompi_datatype_sndrcv instead of ompi_datatype_copy_content_same_ddt to perform copy from sbuf to tmpbuf.

Jun 20 '22 10:06 FlorentGermain-Bull

bot:aws:retest

Jun 21 '22 16:06 awlauria

This is almost certainly an issue on v5.0.x. What about v4.1.x and v4.0.x?

Jun 21 '22 16:06 awlauria

This is almost certainly an issue on v5.0.x. What about v4.1.x and v4.0.x?

The bug is already there in v2.0.4 version. I think this fix can be applied on all supported versions

Jun 22 '22 07:06 FlorentGermain-Bull

There are 2 things addressed in this patch, one optimization and one correctness.

correctness the incorrect use of the rdtype on the step 3. A simpler fix would have been to replace the use in Step 3 of ompi_datatype_copy_content_same_ddt with ompi_datatype_sndrcv (similar with what is proposed here, but without changing the logic of the implementation from using sender type to using receiver type on the temporary buffer).
optimization the move from ompi_datatype_create_indexed to ompi_datatype_create_indexed_block, allowing to reduce the number of temporary buffers used in the operation.

Please change the text of the PR and the log of the commit to reflect the proposed changes.

Jun 22 '22 22:06 bosilca

There are 2 things addressed in this patch, one optimization and one correctness.

correctness the incorrect use of the rdtype on the step 3. A simpler fix would have been to replace the use in Step 3 of ompi_datatype_copy_content_same_ddt with ompi_datatype_sndrcv (similar with what is proposed here, but without changing the logic of the implementation from using sender type to using receiver type on the temporary buffer).

optimization the move from ompi_datatype_create_indexed to ompi_datatype_create_indexed_block, allowing to reduce the number of temporary buffers used in the operation.

Please change the text of the PR and the log of the commit to reflect the proposed changes.

For the point 1. the use of a sdtype based datatype on rbuf on step 2 is also erroneous. If we keep tmpbuf as sdtypes, we also need to create a second new_ddt based on rdtype and replace the ompi_datatype_copy_content_same_ddt with ompi_datatype_sndrcv in the loop which would lead to a performance drop.

Jun 23 '22 06:06 FlorentGermain-Bull

I have another optimization that basically remove the need for the displ array. Instead of using MPI level functions to create the intermediary datatypes, it used OMPI capabilities to create specialized datatypes with a gap in the beginning. Unfortunately I do not have write access to your repo I cannot push.

Here is a quick manual path: replace the create_indexed_block call with the following code and remove the displs array.

        int vlen = (distance * 2) < size ? distance : size - distance;
        struct ompi_datatype_t *ddt;
        ompi_datatype_create_vector( 1,
                                     vlen,
                                     vlen * 2,
                                     sdtype, &ddt);
        new_ddt = ompi_datatype_create( ddt->super.desc.used );
        ompi_datatype_add(new_ddt, ddt, (size + distance - 1) / (distance * 2), distance * sext, vlen * 2 * sext);

        /* Commit the new datatype */
        err = ompi_datatype_commit(&new_ddt);
        if (err != MPI_SUCCESS) { line = __LINE__; goto err_hndl;  }

Using these routines we can create datatype in a more flexible way than using the MPI API, which means we can build the datatype representing the bruck exchange buffer on the fly, and skip the Step 1 (local rotation).

Jun 24 '22 05:06 bosilca

/azp run

Jul 05 '22 15:07 awlauria

Azure Pipelines successfully started running 1 pipeline(s).

Jul 05 '22 15:07 azure-pipelines[bot]

/azp run

Jul 05 '22 22:07 awlauria

Azure Pipelines successfully started running 1 pipeline(s).

Jul 05 '22 22:07 azure-pipelines[bot]

/azp run

Jul 15 '22 18:07 awlauria

Azure Pipelines successfully started running 1 pipeline(s).

Jul 15 '22 18:07 azure-pipelines[bot]

@FlorentGermain-Bull @bosilca @ggouaillardet Is this PR ready to merge?

Jul 17 '22 12:07 jsquyres

@FlorentGermain-Bull @bosilca @ggouaillardet Is this PR ready to merge?

Yes, it is ok for me

@bosilca is the change in the datatype building ok for you?

Jul 28 '22 07:07 FlorentGermain-Bull

@bosilca ^^^

Aug 18 '22 21:08 gpaulsen

I like sndrcv() for the data copying, and I wish an analogue of ompi_datatype_add() was part of the MPI standard, that's a nice API option to have. I think it looks good to me with one caveat: shouldn't new_ddt get freed (or deleted/destroyed, but I'm thinking freed in this context)?

Aug 23 '22 18:08 markalle

ompi ompi copied to clipboard

Coll/base: FIX bruck when sdtype and rdtype are different

ompi
ompi copied to clipboard