mpich icon indicating copy to clipboard operation
mpich copied to clipboard

Hang in one-sided communications across multiple nodes

Open victor-anisimov opened this issue 1 year ago • 60 comments

reproducer.tgz

Once-sided communications between GPU pointers intermittently hang on Aurora. The attached reproducer runs on 96 nodes using six 16-node subcommunicators. One needs to run about 10-50 jobs in order to catch the hang. Successful job contains "All Done" in the output file. If that string is absent after the time runs out, it indicates the hang. The job requires 2-3 minutes of walltime to complete. The job that does not finish in 5 min most likely hangs and won't progress.

victor-anisimov avatar Aug 27 '24 18:08 victor-anisimov

@victor-anisimov I built a debug version of mpich on Aurora at /home/hzhou/pull_requests/mpich-debug/_inst. Could you try link with that and run the reproducer while setting MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=1? The goal is to generate a backtrace as shown in https://github.com/pmodels/mpich/pull/7084

I am not sure the easiest way to switch libmpi.so. Try

export LD_LIBRARY_PATH=/home/hzhou/pull_requests/mpich-debug/_inst/lib:$LD_LIBRARY_PATH

and verify with ldd -- [EDIT: that likely won't work due to so version mismatch]. If that doesn't work, try

LD_PRELOAD=/home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpi.so.0

hzhou avatar Aug 28 '24 20:08 hzhou

Thank you for the suggestion, @hzhou ! I linked the reproducer code against your version of MPICH by doing module unload mpich export PATH=${PATH}:/home/hzhou/pull_requests/mpich-debug/_inst/bin mpif90 -fc=ifx -O0 -g -i4 -r8 -what -fiopenmp -fopenmp-targets=spir64 test.F90 ldd ./a.out libmpifort.so.0 => /home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpifort.so.0 (0x00007f017ddf7000) libmpi.so.0 => /home/hzhou/pull_requests/mpich-debug/_inst/lib/libmpi.so.0 (0x00007f017b486000) ./bulk.sh

The test job crashed at initialization with the error message: forrtl: severe (71): integer divide by zero Image PC Routine Line Source libpthread-2.31.s 00001460507998C0 Unknown Unknown Unknown a.out 000000000040C7E2 Unknown Unknown Unknown a.out 000000000040C47D Unknown Unknown Unknown libc-2.31.so 00001460505C324D __libc_start_main Unknown Unknown a.out 000000000040C3AA Unknown Unknown Unknown

I guess the MPICH build configuration causes this crash. Could you take the reproducer, which is very simple, and try to run the test?

reproducer6.tgz

victor-anisimov avatar Aug 28 '24 21:08 victor-anisimov

module unload mpich

Make sure to set PALS_PMI=pmix if you unload mpich module.

while setting MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=1

Make it MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=10. 1 second timeout was too short. 10 second seems to be sufficient for normal progress without hung.

Could you take the reproducer,

Running 100 loops now (or until my 2 hour walltime kicks me out), will update.

hzhou avatar Aug 29 '24 04:08 hzhou

I think I got some timeouts, need confirm with larger MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT -

...
Rank 0 conducted      1148200 one-sided messages
Rank 0 conducted      1148300 one-sided messages
3 pending requests in pool 0
    ac000004: MPIDI_OFI_do_irecv: source=4, tag=14, data_sz=0
/home/hzhou/MPI/lib/libmpi.so.0(+0xc1bad4) [0x14ab09776ad4]
/home/hzhou/MPI/lib/libmpi.so.0(+0x954a0b) [0x14ab094afa0b]
/home/hzhou/MPI/lib/libmpi.so.0(+0x9557cd) [0x14ab094b07cd]
/home/hzhou/MPI/lib/libmpi.so.0(+0x955167) [0x14ab094b0167]
/home/hzhou/MPI/lib/libmpi.so.0(+0x802a58) [0x14ab0935da58]
/home/hzhou/MPI/lib/libmpi.so.0(+0x6bf95b) [0x14ab0921a95b]
/home/hzhou/MPI/lib/libmpi.so.0(+0x6d1866) [0x14ab0922c866]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7977d8) [0x14ab092f27d8]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797aa3) [0x14ab092f2aa3]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7caee0) [0x14ab09325ee0]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7ca90e) [0x14ab0932590e]
/home/hzhou/MPI/lib/libmpi.so.0(+0x7cac96) [0x14ab09325c96]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797f97) [0x14ab092f2f97]
/home/hzhou/MPI/lib/libmpi.so.0(+0x797c72) [0x14ab092f2c72]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4db4dc) [0x14ab090364dc]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da580) [0x14ab09035580]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da3f9) [0x14ab090353f9]
/home/hzhou/MPI/lib/libmpi.so.0(+0x4da1cc) [0x14ab090351cc]
/home/hzhou/MPI/lib/libmpi.so.0(MPI_Win_fence+0x19) [0x14ab09034e69]
/home/hzhou/MPI/lib/libmpifort.so.0(mpi_win_fence_f08_+0x2d) [0x14ab0b5898ad]
./a.out() [0x412013]
./a.out() [0x40c4fd]
/lib64/libc.so.6(__libc_start_main+0xef) [0x14ab0782d24d]
./a.out() [0x40c42a]
...
Abort(546498063) on node 864: Fatal error in internal_Win_fence: Other MPI error, error stack:
internal_Win_fence(83).....................: MPI_Win_fence(assert=0, win=0xa0000000) failed
MPID_Win_fence(124)........................:
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_Barrier_intra_composition_alpha(184).:
MPIDI_NM_mpi_barrier(41)...................:
MPIR_Barrier_impl(105).....................:
MPIR_Barrier_allcomm_auto(53)..............:
MPIR_Barrier_intra_recexch(19).............:
MPIR_Allreduce_intra_recexch(238)..........:
MPIC_Waitall(570)..........................:
MPIR_Waitall(907)..........................:
MPIR_Waitall_state(809)....................: Operation timed out
Abort(546498063) on node 876: Fatal error in internal_Win_fence: Other MPI error, error stack:
internal_Win_fence(83).....................: MPI_Win_fence(assert=0, win=0xa0000000) failed
MPID_Win_fence(124)........................:
MPID_Barrier(167)..........................:
MPIDI_Barrier_allcomm_composition_json(132):
MPIDI_Barrier_intra_composition_alpha(184).:
MPIDI_NM_mpi_barrier(41)...................:
MPIR_Barrier_impl(105).....................:
MPIR_Barrier_allcomm_auto(53)..............:
MPIR_Barrier_intra_recexch(19).............:
MPIR_Allreduce_intra_recexch(238)..........:
MPIC_Waitall(570)..........................:
MPIR_Waitall(907)..........................:
MPIR_Waitall_state(809)....................: Operation timed out
...

It is stuck in a internal Barrier.

hzhou avatar Aug 29 '24 12:08 hzhou

 grep 'All Done' job*.out
job10.out:All Done
job11.out:All Done
job12.out:All Done
job13.out:All Done
job14.out:All Done
job16.out:All Done
job17.out:All Done
job18.out:All Done
job19.out:All Done
job1.out:All Done
job21.out:All Done
job22.out:All Done
job25.out:All Done
job26.out:All Done
job27.out:All Done
job28.out:All Done
job29.out:All Done
job2.out:All Done
job30.out:All Done
job31.out:All Done
job32.out:All Done
job33.out:All Done
job34.out:All Done
job35.out:All Done
job36.out:All Done
job37.out:All Done
job38.out:All Done
job39.out:All Done
job3.out:All Done
job40.out:All Done
job41.out:All Done
job42.out:All Done
job4.out:All Done
job6.out:All Done
job7.out:All Done
job8.out:All Done
job9.out:All Done
job.out:All Done

Timed out in job 5, 23, and 24.

hzhou avatar Aug 29 '24 12:08 hzhou

I can run the test and am able to reproduce the backtrace after setting

module unload mpich export PALS_PMI=pmix export MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=120

Any idea what might have caused the hang in internal_Win_fence?

victor-anisimov avatar Aug 29 '24 19:08 victor-anisimov

I can run the test and am able to reproduce the backtrace after setting

module unload mpich export PALS_PMI=pmix export MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=120

Any idea what might have caused the hang in internal_Win_fence?

Previous investigation was that one of the target processes somehow gets overwhelmed by MPI_GET requests and basically stops being able to send responses to the origin. We suspected a libfabric/cxi issue, but it was not clear why it would hang rather than report an error and crash. It's possible that #7117 may improve the error reporting in this case.

raffenet avatar Aug 29 '24 20:08 raffenet

Thanks @raffenet for the potential clue. I can see how that (single process get overwhelmed due to lower layer failure or congestion) may cause the issue.

hzhou avatar Aug 29 '24 21:08 hzhou

I tried the latest commit b3480ddfec1d9e98b06783aec97c082eadeca1a7, which includes #7117, built by @colleeneb on Aurora. The behavior of the test is unchanged. I still get 8 hangs per 50 runs of the test. All 8 jobs got killed by the scheduler due to exceeding the walltime limit. None of the 8 hanging jobs include any error or warning messages from MPICH. My conclusion is that either I do not know how to properly run the MPICH to get those forward progress errors printed out or the fixes do not work for the hang in one-sided communications. Could @hzhou or @raffenet try using the latest MPICH with the one-sided bug reproducer to see if the forward progress fixes work for the present case?

victor-anisimov avatar Aug 30 '24 13:08 victor-anisimov

The one built by Colleen doesn't have --enable-g=progress on I believe. We suspect your case is due to the underlying network cannot handle certain congestion situations yet not raising an error. At this point, it is just guess. I need think about how to further narrow down the reproducer. Currently, a 1/20 chance thus hours of time for a single reproducer is a bit difficult to work with.

hzhou avatar Aug 30 '24 13:08 hzhou

Ah, thanks, it does not! I just used ./configure --prefix=$PWD/../install_main --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-ze --with-pm=no --with-pmi=pmix --with-pmix=/usr CC=icx CXX=icpx FC=ifx. I will add --enable-g=progress to the configure line to try! Is there anything else to change in the build to test?

colleeneb avatar Aug 30 '24 13:08 colleeneb

Ah, thanks, it does not! I just used ./configure --prefix=$PWD/../install_main --with-device=ch4:ofi --with-libfabric=/opt/cray/libfabric/1.15.2.0 --with-ze --with-pm=no --with-pmi=pmix --with-pmix=/usr CC=icx CXX=icpx FC=ifx. I will add --enable-g=progress to the configure line to try! Is there anything else to change in the build to test?

@colleeneb Only do that for debug build (vs. performance build). For debug build, you may also want to use --enable-g=dbg,log,progress

hzhou avatar Aug 30 '24 13:08 hzhou

I ran the ANL-116 reproducer on Aurora and with an instrumented libfabric. I verified that the remote key MPICH is passing into libfabric RMA operations is not a key generated by the libfabric CXI provider. There appears to be an MPICH bug resulting in MPICH providing an invalid remote key which can result in hangs and failed RMA operations.

iziemba avatar Nov 06 '24 15:11 iziemba

I ran the ANL-116 reproducer on Aurora and with an instrumented libfabric. I verified that the remote key MPICH is passing into libfabric RMA operations is not a key generated by the libfabric CXI provider. There appears to be an MPICH bug resulting in MPICH providing an invalid remote key which can result in hangs and failed RMA operations.

@iziemba Invalid in what way? I see that user-provided keys are limited to 4-bytes in the fi_cxi manpage. It does look like MPICH could generate a key larger than that the way to code is setup today. But looking at the error checks, I would expect MPICH to get -FI_EKEYREJECTED from fi_mr_reg and return it to the user rather than proceed and hang.

raffenet avatar Nov 06 '24 15:11 raffenet

I tried adding a dummy registration to my MPICH build just to see what happens. Unless I am missing something, it looks like libfabric will not complain about the invalid key, so we need to implement some verification inside MPICH itself. Here is the code for the registration I tried:

    int ret;
    uint64_t key = UINT64_MAX;
    void *buf = MPL_malloc(16, MPL_MEM_OTHER);
    struct fid_mr *mr;
    ret = fi_mr_reg(MPIDI_OFI_global.ctx[0].domain,
                   buf, 16,
                   FI_REMOTE_READ, 0,
                   key, 0, &mr, NULL);
    printf("ret = %d, %s\n", ret, fi_strerror(ret));

and here is the output 😕

ret = 0, Success

raffenet avatar Nov 06 '24 16:11 raffenet

More info after building my own libfabric and adding debug printfs, the provider thinks the requested key is a provider key for some reason and therefore thinks it is valid. If I enable MPIR_CVAR_CH4_OFI_ENABLE_MR_PROV_KEY=1 then the provider actually skips the validation step altogether because of the domain being setup to use prov keys exclusively, I believe.

raffenet avatar Nov 06 '24 17:11 raffenet

So when the user passes in an invalid requested key, it may accidentally cause cxi to think it is a provider key and take the cxip_key.is_prov branch and somehow pass the validation there...

bool cxip_generic_is_valid_mr_key(uint64_t key)
{
        struct cxip_mr_key cxip_key = {
                .raw = key,
        };

        if (cxip_key.is_prov)
                return cxip_is_valid_prov_mr_key(key);

        return cxip_is_valid_mr_key(key);
}

raffenet avatar Nov 06 '24 17:11 raffenet

After looking closer, I no longer thing MPICH will generate a key larger than 4 bytes in this scenario. Must be something else.

raffenet avatar Nov 06 '24 19:11 raffenet

For user keys, I can update CXI provider to have check for RKEY size and ensure it is <= 4 bytes.

FI_MR_PROV_KEY key size is 8 bytes. When running with MPIR_CVAR_CH4_OFI_ENABLE_MR_PROV_KEY=1, I verified that libfabric CXI provider was returning expected generated RKEYs. Then, I see MPICH use an invalid key. Example:

libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x4f40f70 (key: 0xA000DD743707002F)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x4c37c10 (key: 0xA000EF2705D8004D)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894a10 (key: 0xA000A22111480016)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x58947b0 (key: 0xA000A2211149001B)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x367ec40 (key: 0xA000EF2705D9003C)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x367e870 (key: 0xA000EF2705DA0028)
libfabric:125040:1730402622::cxi:core:cxip_rma_common():609<warn> x4414c7s5b0n0: TXC (0x35f41:0):: Invalid remote key: 0x455346464f5f4441
libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x4e62670 (key: 0xA000DD7437080029)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894c70 (key: 0xA000A221114A0024)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5894ed0 (key: 0xA000A221114B0037)
libfabric:130635:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c0s5b0n0: MR enabled: 0x50c8b30 (key: 0xA000DD7437090027)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x4c8df10 (key: 0xA000EF2705DB0040)
libfabric:121447:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4313c5s5b0n0: MR enabled: 0x436bdf0 (key: 0xA000EF2705DC0036)
libfabric:115506:1730402622::cxi:mr:cxip_mr_enable():1160<warn> x4416c7s6b0n0: MR enabled: 0x5895390 (key: 0xA000A221114C0009)

iziemba avatar Nov 06 '24 20:11 iziemba

libfabric:125040:1730402622::cxi:core:cxip_rma_common():609<warn> x4414c7s5b0n0: TXC (0x35f41:0):: Invalid remote key: 0x455346464f5f4441

Hm, OK. The warning is from the RMA origin side. So we need to trace keys from creation at the target -> usage at the origin and see if there's corruption or something.

raffenet avatar Nov 06 '24 20:11 raffenet

Potential work around in #7202

hzhou avatar Nov 10 '24 15:11 hzhou

Update: 7202 didn’t help but turning HMEM on did, although it led to the device running out of memory

colleeneb avatar Dec 18 '24 20:12 colleeneb

Need to investigate the out of memory issue and see if it can be resolved in MPICH or libfabric.

raffenet avatar Dec 18 '24 20:12 raffenet

@colleeneb are you able to re-test with 7202 (newly rebased on main) and with HMEM on? Now that some other GPU-related fixes are in main, we want to see the new status of this

abrooks98 avatar Mar 25 '25 13:03 abrooks98

Yes, I will put a build of 7202 on the system soon. Thanks for following up!

colleeneb avatar Mar 25 '25 22:03 colleeneb

Thanks, @colleeneb and @abrooks98 ! I ran one hundred 144-node tests using the 7202 build with HMEM on on Aurora, and I got about 1 hang per 10-15 successful runs. It's a progress but the fix is incomplete. The error messages are of the following type:

libfabric:103644:1742955972::cxi:ep_data:cxip_txc_can_emit_op():515 x4519c2s3b0n0: TXC (0x42aa0:1): TXC attr size saturated libfabric:103644:1742955972::cxi:ep_data:cxip_rma_emit_dma():359 x4519c2s3b0n0: TXC (0x42aa0:1): Failed to emit dma command: -11:Resource temporarily unavailable libfabric:103644:1742955972::cxi:ep_data:cxip_rma_common():663 x4519c2s3b0n0: TXC (0x42aa0:1): DMA RMA read failed: buf=0xff000003c25a9060 len=26880 rkey=0x800001d0 roffset=0xf08160 nic=0x42ab0 pid=0 pid_idx=228

victor-anisimov avatar Mar 26 '25 02:03 victor-anisimov

libfabric:103644:1742955972::cxi:ep_data:cxip_txc_can_emit_op():515 x4519c2s3b0n0: TXC (0x42aa0:1): TXC attr size saturated libfabric:103644:1742955972::cxi:ep_data:cxip_rma_emit_dma():359 x4519c2s3b0n0: TXC (0x42aa0:1): Failed to emit dma command: -11:Resource temporarily unavailable libfabric:103644:1742955972::cxi:ep_data:cxip_rma_common():663 x4519c2s3b0n0: TXC (0x42aa0:1): DMA RMA read failed: buf=0xff000003c25a9060 len=26880 rkey=0x800001d0 roffset=0xf08160 nic=0x42ab0 pid=0 pid_idx=228

I'm surprised these aren't being caught by MPICH and raised as an error. If libfabric is truly in an unrecoverable state after these warnings are issued, we need to be able to capture that and abort the application rather than hang. We should build a development version of libfabric/cxi and try and force an abort in this path to capture the full backtrace and see if we are missing an error check or not performing the necessary recovery correctly.

raffenet avatar Mar 27 '25 14:03 raffenet

@victor-anisimov do you think you could try the 7202 build without HMEM on and see how that is?

colleeneb avatar Mar 27 '25 18:03 colleeneb

Tried using the build 7202 without HMEM on. Tested on 20 independent runs. None of those 144-node jobs succeeded. All jobs crashed. It looks that one-sided communications do not work without HMEM being on.

victor-anisimov avatar Mar 27 '25 21:03 victor-anisimov

libfabric:103644:1742955972::cxi:ep_data:cxip_txc_can_emit_op():515 x4519c2s3b0n0: TXC (0x42aa0:1): TXC attr size saturated libfabric:103644:1742955972::cxi:ep_data:cxip_rma_emit_dma():359 x4519c2s3b0n0: TXC (0x42aa0:1): Failed to emit dma command: -11:Resource temporarily unavailable libfabric:103644:1742955972::cxi:ep_data:cxip_rma_common():663 x4519c2s3b0n0: TXC (0x42aa0:1): DMA RMA read failed: buf=0xff000003c25a9060 len=26880 rkey=0x800001d0 roffset=0xf08160 nic=0x42ab0 pid=0 pid_idx=228

This is happening due to TX size attribute being exceeding. When this happens, the CXI provider should be running -FI_EAGAIN. What does MPICH do when it sees -FI_EAGAIN?

iziemba avatar Apr 09 '25 19:04 iziemba