mpich Hang in one-sided communications across multiple nodes

reproducer.tgz

Once-sided communications between GPU pointers intermittently hang on Aurora. The attached reproducer runs on 96 nodes using six 16-node subcommunicators. One needs to run about 10-50 jobs in order to catch the hang. Successful job contains "All Done" in the output file. If that string is absent after the time runs out, it indicates the hang. The job requires 2-3 minutes of walltime to complete. The job that does not finish in 5 min most likely hangs and won't progress.

Aug 27 '24 18:08 victor-anisimov

@victor-anisimov I built a debug version of mpich on Aurora at /home/hzhou/pull_requests/mpich-debug/_inst. Could you try link with that and run the reproducer while setting MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=1? The goal is to generate a backtrace as shown in https://github.com/pmodels/mpich/pull/7084

I am not sure the easiest way to switch libmpi.so. Try

export LD_LIBRARY_PATH=/home/hzhou/pull_requests/mpich-debug/_inst/lib:$LD_LIBRARY_PATH

and verify with ldd -- [EDIT: that likely won't work due to so version mismatch]. If that doesn't work, try

LD_PRELOAD=/home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpi.so.0

Aug 28 '24 20:08 hzhou

Thank you for the suggestion, @hzhou ! I linked the reproducer code against your version of MPICH by doing module unload mpich export PATH=${PATH}:/home/hzhou/pull_requests/mpich-debug/_inst/bin mpif90 -fc=ifx -O0 -g -i4 -r8 -what -fiopenmp -fopenmp-targets=spir64 test.F90 ldd ./a.out libmpifort.so.0 => /home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpifort.so.0 (0x00007f017ddf7000) libmpi.so.0 => /home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpi.so.0 (0x00007f017b486000) ./bulk.sh

The test job crashed at initialization with the error message: forrtl: severe (71): integer divide by zero Image PC Routine Line Source libpthread-2.31.s 00001460507998C0 Unknown Unknown Unknown a.out 000000000040C7E2 Unknown Unknown Unknown a.out 000000000040C47D Unknown Unknown Unknown libc-2.31.so 00001460505C324D __libc_start_main Unknown Unknown a.out 000000000040C3AA Unknown Unknown Unknown

I guess the MPICH build configuration causes this crash. Could you take the reproducer, which is very simple, and try to run the test?

reproducer6.tgz

Aug 28 '24 21:08 victor-anisimov

module unload mpich

Make sure to set PALS_PMI=pmix if you unload mpich module.

while setting MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=1

Make it MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=10. 1 second timeout was too short. 10 second seems to be sufficient for normal progress without hung.

Could you take the reproducer,

Running 100 loops now (or until my 2 hour walltime kicks me out), will update.

Aug 29 '24 04:08 hzhou

I think I got some timeouts, need confirm with larger MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT -

...
Rank 0 conducted      1148200 one-sided messages
Rank 0 conducted      1148300 one-sided messages
3 pending requests in pool 0
    ac000004: MPIDI_OFI_do_irecv: source=4, tag=14, data_sz=0
/home/hzhou/MPI/lib/libmpi.so.0(+0xc1bad4) [0x14ab09776ad4]
/home/hzhou/MPI/lib/libmpi.so.0(+0x954a0b) [0x14ab094afa0b]
/home/hzhou/MPI/lib/libmpi.so.0(+0x9557cd) [0x14ab094b07cd]
/home/hzhou/MPI/lib/libmpi.so.0(+0x955167) [0x14ab094b0167]
/home/hzhou/MPI/lib/libmpi.so.0(+0x802a58) [0x14ab0935da58]
/home/hzhou/MPI/lib/libmpi.so.0(+0x6bf95b) [0x14ab0921a95b]
/home/hzhou/MPI/lib/libmpi.so.0(+0x6d1866) [0x14ab0922c866]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7977d8) [0x14ab092f27d8]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797aa3) [0x14ab092f2aa3]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7caee0) [0x14ab09325ee0]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7ca90e) [0x14ab0932590e]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7cac96) [0x14ab09325c96]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797f97) [0x14ab092f2f97]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797c72) [0x14ab092f2c72]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4db4dc) [0x14ab090364dc]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da580) [0x14ab09035580]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da3f9) [0x14ab090353f9]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da1cc) [0x14ab090351cc]
/home/hzhou/MPI/lib/libmpi.so.0(MPI_Win_fence+0x19) [0x14ab09034e69]
/home/hzhou/MPI/lib/libmpifort.so.0(mpi_win_fence_f08_+0x2d) [0x14ab0b5898ad]
./a.out() [0x412013]
./a.out() [0x40c4fd]
/lib64/libc.so.6(__libc_start_main+0xef) [0x14ab0782d24d]
./a.out() [0x40c42a]
...
Abort(546498063) on node 864: Fatal error in internal_Win_fence: Other MPI error, error stack:
internal_Win_fence(83).....................: MPI_Win_fence(assert=0, win=0xa0000000) failed
MPID_Win_fence(124)........................:
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_Barrier_intra_composition_alpha(184).:
MPIDI_NM_mpi_barrier(41)...................:
MPIR_Barrier_impl(105).....................:
MPIR_Barrier_allcomm_auto(53)..............:
MPIR_Barrier_intra_recexch(19).............:
MPIR_Allreduce_intra_recexch(238)..........:
MPIC_Waitall(570)..........................:
MPIR_Waitall(907)..........................:
MPIR_Waitall_state(809)....................: Operation timed out
Abort(546498063) on node 876: Fatal error in internal_Win_fence: Other MPI error, error stack:
internal_Win_fence(83).....................: MPI_Win_fence(assert=0, win=0xa0000000) failed
MPID_Win_fence(124)........................:
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_Barrier_intra_composition_alpha(184).:
MPIDI_NM_mpi_barrier(41)...................:
MPIR_Barrier_impl(105).....................:
MPIR_Barrier_allcomm_auto(53)..............:
MPIR_Barrier_intra_recexch(19).............:
MPIR_Allreduce_intra_recexch(238)..........:
MPIC_Waitall(570)..........................:
MPIR_Waitall(907)..........................:
MPIR_Waitall_state(809)....................: Operation timed out
...

It is stuck in a internal Barrier.

Aug 29 '24 12:08 hzhou

 grep 'All Done' job*.out
job10.out:All Done
job11.out:All Done
job12.out:All Done
job13.out:All Done
job14.out:All Done
job16.out:All Done
job17.out:All Done
job18.out:All Done
job19.out:All Done
job1.out:All Done
job21.out:All Done
job22.out:All Done
job25.out:All Done
job26.out:All Done
job27.out:All Done
job28.out:All Done
job29.out:All Done
job2.out:All Done
job30.out:All Done
job31.out:All Done
job32.out:All Done
job33.out:All Done
job34.out:All Done
job35.out:All Done
job36.out:All Done
job37.out:All Done
job38.out:All Done
job39.out:All Done
job3.out:All Done
job40.out:All Done
job41.out:All Done
job42.out:All Done
job4.out:All Done
job6.out:All Done
job7.out:All Done
job8.out:All Done
job9.out:All Done
job.out:All Done

Timed out in job 5, 23, and 24.

Aug 29 '24 12:08 hzhou

I can run the test and am able to reproduce the backtrace after setting

module unload mpich export PALS_PMI=pmix export MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=120

Any idea what might have caused the hang in internal_Win_fence?

Aug 29 '24 19:08 victor-anisimov

I can run the test and am able to reproduce the backtrace after setting

module unload mpich export PALS_PMI=pmix export MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=120

Any idea what might have caused the hang in internal_Win_fence?

Previous investigation was that one of the target processes somehow gets overwhelmed by MPI_GET requests and basically stops being able to send responses to the origin. We suspected a libfabric/cxi issue, but it was not clear why it would hang rather than report an error and crash. It's possible that #7117 may improve the error reporting in this case.

Aug 29 '24 20:08 raffenet

Thanks @raffenet for the potential clue. I can see how that (single process get overwhelmed due to lower layer failure or congestion) may cause the issue.

Aug 29 '24 21:08 hzhou

I tried the latest commit b3480ddfec1d9e98b06783aec97c082eadeca1a7, which includes #7117, built by @colleeneb on Aurora. The behavior of the test is unchanged. I still get 8 hangs per 50 runs of the test. All 8 jobs got killed by the scheduler due to exceeding the walltime limit. None of the 8 hanging jobs include any error or warning messages from MPICH. My conclusion is that either I do not know how to properly run the MPICH to get those forward progress errors printed out or the fixes do not work for the hang in one-sided communications. Could @hzhou or @raffenet try using the latest MPICH with the one-sided bug reproducer to see if the forward progress fixes work for the present case?

Aug 30 '24 13:08 victor-anisimov

The one built by Colleen doesn't have --enable-g=progress on I believe. We suspect your case is due to the underlying network cannot handle certain congestion situations yet not raising an error. At this point, it is just guess. I need think about how to further narrow down the reproducer. Currently, a 1/20 chance thus hours of time for a single reproducer is a bit difficult to work with.

Aug 30 '24 13:08 hzhou

Ah, thanks, it does not! I just used ./configure --prefix=$PWD/../install_main --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-ze --with-pm=no --with-pmi=pmix --with-pmix=/usr CC=icx CXX=icpx FC=ifx. I will add --enable-g=progress to the configure line to try! Is there anything else to change in the build to test?

Aug 30 '24 13:08 colleeneb

Ah, thanks, it does not! I just used ./configure --prefix=$PWD/../install_main --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-ze --with-pm=no --with-pmi=pmix --with-pmix=/usr CC=icx CXX=icpx FC=ifx. I will add --enable-g=progress to the configure line to try! Is there anything else to change in the build to test?

@colleeneb Only do that for debug build (vs. performance build). For debug build, you may also want to use --enable-g=dbg,log,progress

Aug 30 '24 13:08 hzhou

I ran the ANL-116 reproducer on Aurora and with an instrumented libfabric. I verified that the remote key MPICH is passing into libfabric RMA operations is not a key generated by the libfabric CXI provider. There appears to be an MPICH bug resulting in MPICH providing an invalid remote key which can result in hangs and failed RMA operations.

Nov 06 '24 15:11 iziemba

I ran the ANL-116 reproducer on Aurora and with an instrumented libfabric. I verified that the remote key MPICH is passing into libfabric RMA operations is not a key generated by the libfabric CXI provider. There appears to be an MPICH bug resulting in MPICH providing an invalid remote key which can result in hangs and failed RMA operations.

@iziemba Invalid in what way? I see that user-provided keys are limited to 4-bytes in the fi_cxi manpage. It does look like MPICH could generate a key larger than that the way to code is setup today. But looking at the error checks, I would expect MPICH to get -FI_EKEYREJECTED from fi_mr_reg and return it to the user rather than proceed and hang.

Nov 06 '24 15:11 raffenet

I tried adding a dummy registration to my MPICH build just to see what happens. Unless I am missing something, it looks like libfabric will not complain about the invalid key, so we need to implement some verification inside MPICH itself. Here is the code for the registration I tried:

    int ret;
    uint64_t key = UINT64_MAX;
    void *buf = MPL_malloc(16, MPL_MEM_OTHER);
    struct fid_mr *mr;
    ret = fi_mr_reg(MPIDI_OFI_global.ctx[0].domain,
                   buf, 16,
                   FI_REMOTE_READ, 0,
                   key, 0, &mr, NULL);
    printf("ret = %d, %s\n", ret, fi_strerror(ret));

and here is the output 😕

ret = 0, Success

Nov 06 '24 16:11 raffenet

More info after building my own libfabric and adding debug printfs, the provider thinks the requested key is a provider key for some reason and therefore thinks it is valid. If I enable MPIR_CVAR_CH4_OFI_ENABLE_MR_PROV_KEY=1 then the provider actually skips the validation step altogether because of the domain being setup to use prov keys exclusively, I believe.

Nov 06 '24 17:11 raffenet

So when the user passes in an invalid requested key, it may accidentally cause cxi to think it is a provider key and take the cxip_key.is_prov branch and somehow pass the validation there...

bool cxip_generic_is_valid_mr_key(uint64_t key)
{
        struct cxip_mr_key cxip_key = {
                .raw = key,
        };

        if (cxip_key.is_prov)
                return cxip_is_valid_prov_mr_key(key);

        return cxip_is_valid_mr_key(key);
}

Nov 06 '24 17:11 raffenet

After looking closer, I no longer thing MPICH will generate a key larger than 4 bytes in this scenario. Must be something else.

Nov 06 '24 19:11 raffenet

For user keys, I can update CXI provider to have check for RKEY size and ensure it is <= 4 bytes.

FI_MR_PROV_KEY key size is 8 bytes. When running with MPIR_CVAR_CH4_OFI_ENABLE_MR_PROV_KEY=1, I verified that libfabric CXI provider was returning expected generated RKEYs. Then, I see MPICH use an invalid key. Example:

libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x4f40f70 (key: 0xA000DD743707002F)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x4c37c10 (key: 0xA000EF2705D8004D)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894a10 (key: 0xA000A22111480016)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x58947b0 (key: 0xA000A2211149001B)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x367ec40 (key: 0xA000EF2705D9003C)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x367e870 (key: 0xA000EF2705DA0028)
libfabric:125040:1730402622::cxi:core:cxip_rma_common():609<warn> x4414c7s5b0n0: TXC (0x35f41:0):: Invalid remote key: 0x455346464f5f4441
libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x4e62670 (key: 0xA000DD7437080029)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894c70 (key: 0xA000A221114A0024)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894ed0 (key: 0xA000A221114B0037)
libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x50c8b30 (key: 0xA000DD7437090027)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x4c8df10 (key: 0xA000EF2705DB0040)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x436bdf0 (key: 0xA000EF2705DC0036)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5895390 (key: 0xA000A221114C0009)

Nov 06 '24 20:11 iziemba

libfabric:125040:1730402622::cxi:core:cxip_rma_common():609<warn> x4414c7s5b0n0: TXC (0x35f41:0):: Invalid remote key: 0x455346464f5f4441

Hm, OK. The warning is from the RMA origin side. So we need to trace keys from creation at the target -> usage at the origin and see if there's corruption or something.

Nov 06 '24 20:11 raffenet

Potential work around in #7202

Nov 10 '24 15:11 hzhou

Update: 7202 didn’t help but turning HMEM on did, although it led to the device running out of memory

Dec 18 '24 20:12 colleeneb

Need to investigate the out of memory issue and see if it can be resolved in MPICH or libfabric.

Dec 18 '24 20:12 raffenet

@colleeneb are you able to re-test with 7202 (newly rebased on main) and with HMEM on? Now that some other GPU-related fixes are in main, we want to see the new status of this

Mar 25 '25 13:03 abrooks98

Yes, I will put a build of 7202 on the system soon. Thanks for following up!

Mar 25 '25 22:03 colleeneb

Thanks, @colleeneb and @abrooks98 ! I ran one hundred 144-node tests using the 7202 build with HMEM on on Aurora, and I got about 1 hang per 10-15 successful runs. It's a progress but the fix is incomplete. The error messages are of the following type:

libfabric:103644:1742955972::cxi:ep_data:cxip_txc_can_emit_op():515 x4519c2s3b0n0: TXC (0x42aa0:1): TXC attr size saturated libfabric:103644:1742955972::cxi:ep_data:cxip_rma_emit_dma():359 x4519c2s3b0n0: TXC (0x42aa0:1): Failed to emit dma command: -11:Resource temporarily unavailable libfabric:103644:1742955972::cxi:ep_data:cxip_rma_common():663 x4519c2s3b0n0: TXC (0x42aa0:1): DMA RMA read failed: buf=0xff000003c25a9060 len=26880 rkey=0x800001d0 roffset=0xf08160 nic=0x42ab0 pid=0 pid_idx=228

Mar 26 '25 02:03 victor-anisimov

libfabric:103644:1742955972::cxi:ep_data:cxip_txc_can_emit_op():515 x4519c2s3b0n0: TXC (0x42aa0:1): TXC attr size saturated libfabric:103644:1742955972::cxi:ep_data:cxip_rma_emit_dma():359 x4519c2s3b0n0: TXC (0x42aa0:1): Failed to emit dma command: -11:Resource temporarily unavailable libfabric:103644:1742955972::cxi:ep_data:cxip_rma_common():663 x4519c2s3b0n0: TXC (0x42aa0:1): DMA RMA read failed: buf=0xff000003c25a9060 len=26880 rkey=0x800001d0 roffset=0xf08160 nic=0x42ab0 pid=0 pid_idx=228

I'm surprised these aren't being caught by MPICH and raised as an error. If libfabric is truly in an unrecoverable state after these warnings are issued, we need to be able to capture that and abort the application rather than hang. We should build a development version of libfabric/cxi and try and force an abort in this path to capture the full backtrace and see if we are missing an error check or not performing the necessary recovery correctly.

Mar 27 '25 14:03 raffenet

@victor-anisimov do you think you could try the 7202 build without HMEM on and see how that is?

Mar 27 '25 18:03 colleeneb

Tried using the build 7202 without HMEM on. Tested on 20 independent runs. None of those 144-node jobs succeeded. All jobs crashed. It looks that one-sided communications do not work without HMEM being on.

Mar 27 '25 21:03 victor-anisimov

libfabric:103644:1742955972::cxi:ep_data:cxip_txc_can_emit_op():515 x4519c2s3b0n0: TXC (0x42aa0:1): TXC attr size saturated libfabric:103644:1742955972::cxi:ep_data:cxip_rma_emit_dma():359 x4519c2s3b0n0: TXC (0x42aa0:1): Failed to emit dma command: -11:Resource temporarily unavailable libfabric:103644:1742955972::cxi:ep_data:cxip_rma_common():663 x4519c2s3b0n0: TXC (0x42aa0:1): DMA RMA read failed: buf=0xff000003c25a9060 len=26880 rkey=0x800001d0 roffset=0xf08160 nic=0x42ab0 pid=0 pid_idx=228

This is happening due to TX size attribute being exceeding. When this happens, the CXI provider should be running -FI_EAGAIN. What does MPICH do when it sees -FI_EAGAIN?

Apr 09 '25 19:04 iziemba