mpich Inter-node MPI_Get on GPU buffer hangs

When a large number of MPI_Get are called before an MPI_Win_fence on the GPU buffer across nodes, the program seems to hang. I will share the location of the reproducer by email.

In MPIDIG_mpi_win_fence, some ranks stuck at

MPIDIU_PROGRESS_DO_WHILE(MPIR_cc_get(MPIDIG_WIN(win, local_cmpl_cnts)) != 0 ||
                                                   MPIR_cc_get(MPIDIG_WIN(win, remote_acc_cmpl_cnts)) != 0, vci);

with a greater than 0 local_cmpl_cnts.

There are some observations from previous experiments:

The reproducer works for large messages but fails for small messages Change all the messages to 40KB, works Change all the messages to 8KB, fails – eager protocol
The reproducer works for the CPU buffer but fails for the GPU buffer
The reproducer works if MPI_Win_fence is called more often; for example, calling MPI_Win_fence for every 1000 messages work (even for small messages on GPU)
The reproducer works if all the ranks are on the same node but fails if ranks are distributed on multiple nodes.

Jan 30 '24 22:01 dycz0fx

@raffenet @hzhou Hi Ken and Hui, I have sent the reproducer by email. Would you please point me to the CVARs you mentioned in the meeting so I can give them a try?

Jan 30 '24 23:01 dycz0fx

@raffenet @hzhou Hi Ken and Hui, I have sent the reproducer by email. Would you please point me to the CVARs you mentioned in the meeting so I can give them a try?

I was wondering if there is an issue if you exhaust a "chunk" of GenQ buffers when issuing the gets. You could try to increase the number of buffers per chunk with MPIR_CVAR_CH4_NUM_PACK_BUFFERS_PER_CHUNK.

Jan 31 '24 21:01 raffenet

Now reading more closely, the ranks are stuck waiting for active message RMA completions, not the native libfabric ops. I'll have to take another look at the AM code since it differs from the netmod implementation.

Jan 31 '24 22:01 raffenet

I reproduced this with MPICH main on Sunspot. Just jotting down some notes after adding printfs to the code. All MPI_Get operations are going thru the active message path. For the process which issues the get, a request is created and the local window completion counter is incremented. At request completion (i.e. the data has been sent back from the target and placed in the user buffer), the local counter is decremented via the completion_notification pointer.

When the code hangs, I observe that the completion notification mechanism is never triggered, meaning the window counter grows and stays where it is, leading to the fence never returning. When fence is called more frequently, I can observe requests completing and completion_notification getting triggered as expected.

Feb 07 '24 17:02 raffenet

Processes are stuck in an infinite loop here https://github.com/pmodels/mpich/blob/8af3921493b7961baf4103ed5e1e2a2738df363e/src/mpid/ch4/netmod/ofi/ofi_am_impl.h#L379.

The target is trying to send the responses back to the origin, but it is getting EAGAIN. Need to understand why progress is not being made so those messages can get through.

Feb 07 '24 22:02 raffenet