mpich icon indicating copy to clipboard operation
mpich copied to clipboard

MPI_Test is blocking and MPI_Start doe not start communication

Open psteinbrecher opened this issue 2 years ago • 6 comments

QUDA applications relies on non-blocking behavior of MPI_Test. However, current MPICH implementation of MPI_Test is blocking. When tracing I see that MPI_Test only returns when communication is complete. MPI_Test should just test if communication is complete and not wait for completion. QUDA app sends multiple message for halo exchange. This issue serializes all communication within a halo exchange.

QUDA also benefits from MPI_Start actually starting the communication. But current MPICH implementation does not do this. Whatever change we put in to fix MPI_Test could be used in MPI_Start, e.g. by calling MPI_Test inside MPI_Start at end of function.

Here are the other MPI implementations that have the needed behavior: OpenMPI HPCX Cray MPI MVAPICH

You can test the behavior on Aurora with attached reproducer t.cpp. t.zip

Run on 1 Aurora node via:

mpicxx -O3 -fiopenmp -fopenmp-targets=spir64 ./t.cpp && mpiexec -np 2 -ppn 2 -envall --cpu-bind=verbose,list:2-8:12-18 ./run_aout.sh
#!/bin/bash

export MPIR_CVAR_GPU_USE_IMMEDIATE_COMMAND_LIST=1
export MPIR_CVAR_GPU_ROUND_ROBIN_COMMAND_QUEUES=1
export MPIR_CVAR_CH4_GPU_RMA_ENGINE_TYPE=auto

if [ $PALS_LOCAL_RANKID -eq 0 ]
then

    ZE_AFFINITY_MASK=0.0 ./a.out

elif [ $PALS_LOCAL_RANKID -eq 1 ]
then

    ZE_AFFINITY_MASK=0.1 ./a.out
fi

Then use your favorite tracing tool, e.g. unitrace or iprof, to see if all comms run in parallel and also overlap with compute kernel.

Here is screenshot that shows that we do not have overlap or parallel comms due to blocking nature of MPI_Test in MPICH. mpi_test

psteinbrecher avatar Dec 06 '23 20:12 psteinbrecher

I see. So it is the IPC path. We need to make IPC GPU copy nonblocking.

hzhou avatar Dec 08 '23 15:12 hzhou

Let us know if you need any kind of help.

jxy avatar Dec 08 '23 22:12 jxy

@jxy Thanks. Could you confirm that the blocking nature of MPI_Test is observed for both intra-node and inter-node? I'll focus on the intra-node first and ping you for testing when we have a patch.

hzhou avatar Dec 08 '23 22:12 hzhou

Yes, both have the issue. Getting intra-node first sounds like a good plan.

psteinbrecher avatar Dec 08 '23 22:12 psteinbrecher

@psteinbrecher @jxy This PR https://github.com/pmodels/mpich/pull/6841 should fix the blocking issues of MPI_Test at least for contiguous datatype intra-node. Could you test?

hzhou avatar Jan 23 '24 04:01 hzhou

Yes, let me try to build and run it!

psteinbrecher avatar Jan 24 '24 23:01 psteinbrecher

Overlap not happening with this change. Let's discuss directly. Will reach out to you.

psteinbrecher avatar Jul 01 '24 14:07 psteinbrecher

Tested the reproducer using the main branch on sunspot, running two process on single node, with some printf debugging:

[1] num_elements = 128000000 (max 256000000)
[0] num_elements = 128000000 (max 256000000)
[0] . MPIDI_IPCI_handle_lmt_recv: src_data_sz=512000000, data_sz=512000000, dev_id=0, map_dev=0
[0]   . engine=1 MPIR_Ilocalcopy_gpu...
[0]   . MPIR_Async_things_add: is MPIR_GPU_REQUEST: 1, async_count = 0
[0] . MPIDI_IPCI_handle_lmt_recv: src_data_sz=512000000, data_sz=512000000, dev_id=0, map_dev=0
[0]   . engine=1 MPIR_Ilocalcopy_gpu...
[0]   . MPIR_Async_things_add: is MPIR_GPU_REQUEST: 1, async_count = 1
[1] . MPIDI_IPCI_handle_lmt_recv: src_data_sz=512000000, data_sz=512000000, dev_id=0, map_dev=0
[1]   . engine=1 MPIR_Ilocalcopy_gpu...
[1]   . MPIR_Async_things_add: is MPIR_GPU_REQUEST: 1, async_count = 0
[1] . MPIDI_IPCI_handle_lmt_recv: src_data_sz=512000000, data_sz=512000000, dev_id=0, map_dev=0
[1]   . engine=1 MPIR_Ilocalcopy_gpu...
[1]   . MPIR_Async_things_add: is MPIR_GPU_REQUEST: 1, async_count = 1
[1]   . gpu_ipc_async_poll succeeded, async_count = 6258
[1] recv_request completed after MPI_Test 3131 times
[1]   . gpu_ipc_async_poll succeeded, async_count = 11184
[0]   . gpu_ipc_async_poll succeeded, async_count = 38532
[0] recv_request completed after MPI_Test 19268 times
[1] recv_request2 completed after MPI_Test 1 times
[0]   . gpu_ipc_async_poll succeeded, async_count = 64342
[0] recv_request2 completed after MPI_Test 25809 times
[0] local_time = 38.414 ms
[0] ================ rank 0 =====================
[0] STREAM array size = 1024 MB
[0] MPI message size = 512 MB
[0] total time = 38414 usec
[0] MPI bandwidth = 26.6569 GB/s
[1] local_time = 38.404 ms
[1] ================ rank [1] 1 =====================
[1] STREAM array size = [1] 1024 MB
[1] MPI message size = 512 MB
[1] total time = 38404 usec
[1] MPI bandwidth = 26.6639 GB/s

I think this shows that MPI_Test is not blocking, otherwise, it will complete after 1 time.

hzhou avatar Jul 02 '24 20:07 hzhou

Yes, sorry. Was mistake on my side with an older test I provided here. It works fine now. Will test it with QUDA next. Overlap of comm and compute is seen as well as multiple comms running in parallel.

psteinbrecher avatar Jul 03 '24 02:07 psteinbrecher

The second part of this issue is -- "MPI_Start doe not start communication"

This is an issue of MPI does not provide strong progress guarantee. A strong progress means MPI is able to make progress and complete the communication after the start (MPI_Start, MPI_Irecv, etc.) without users make any MPI calls in-between. This is achievable for simple protocols when the NIC is able to carry out. For more "fancy" algorithms, such as pipelining, large message RNDV, CPU progression is still required. And when it is required, it may require multiple progress pokes depend on the communication stages. Thus, to ensure a full overlap, the application generally need a progression scheme that regularly calls MPI_Test while performing computations.

hzhou avatar Jul 08 '24 15:07 hzhou

What is the status for the MPI_Test blocking issue with inter-node comms?

jxy avatar Jul 08 '24 16:07 jxy

The inter-node comms depend on the actual paths. Ideally, it is supposed to be routed to the native path -- CXI -- and it is supposed to perform RDMA asynchronously. But I understand currently there are issues and it is set to fallback and make host copies or use a pipeline algorithm (contributed by Intel, enabled by setting a CVAR). The latter is asynchronous. The former I believe is blocking. Could you (maybe @psteinbrecher) create a separate issue to track the inter-node case?

hzhou avatar Jul 08 '24 16:07 hzhou