Victor Anisimov comments

Results 10 comments of


                                            Victor Anisimov

One-sided communications in MPICH are considerably slower than those in Aurora MPICH

I have conducted performance test for one-sided communications between two GPU pointers on a single node by using a single-file reproducer (12 MPI ranks per node). The use of default...

One-sided communications in MPICH are considerably slower than those in Aurora MPICH

The performance test of host-to-host one-sided communications on a single node shows 3x slowdown. The reproducer that can run either on a host or on a device is attached. [reproducer9-host-device.tgz](https://github.com/user-attachments/files/18431580/reproducer9-host-device.tgz)

Hang in one-sided communications across multiple nodes

Thank you for the suggestion, @hzhou ! I linked the reproducer code against your version of MPICH by doing module unload mpich export PATH=${PATH}:/home/hzhou/pull_requests/mpich-debug/_inst/bin mpif90 -fc=ifx -O0 -g -i4 -r8...

Hang in one-sided communications across multiple nodes

I can run the test and am able to reproduce the backtrace after setting module unload mpich export PALS_PMI=pmix export MPIR_CVAR_DEBUG_PROGRESS_TIMEOUT=120 Any idea what might have caused the hang in...

Hang in one-sided communications across multiple nodes

I tried the latest commit b3480ddfec1d9e98b06783aec97c082eadeca1a7, which includes #7117, built by @colleeneb on Aurora. The behavior of the test is unchanged. I still get 8 hangs per 50 runs of...

Hang in one-sided communications across multiple nodes

Thanks, @colleeneb and @abrooks98 ! I ran one hundred 144-node tests using the 7202 build with HMEM on on Aurora, and I got about 1 hang per 10-15 successful runs....

Hang in one-sided communications across multiple nodes

Tried using the build 7202 without HMEM on. Tested on 20 independent runs. None of those 144-node jobs succeeded. All jobs crashed. It looks that one-sided communications do not work...

mpl/ze: fast_memcpy crash due to mmap in implicit mode

Interestingly, the problem in the implicit mode is not with the total size of the buffer but with the size following a certain rule. bufferSize = 1024 * 63     ...

mpl/ze: fast_memcpy crash due to mmap in implicit mode

Here is a smaller reproducer using only 2 ranks [test-small.F90.txt](https://github.com/user-attachments/files/22752403/test-small.F90.txt) [run-small-test.sh](https://github.com/user-attachments/files/22752404/run-small-test.sh)

mpl/ze: fast_memcpy crash due to mmap in implicit mode

The use of MPIR_CVAR_CH4_IPC_GPU_RMA_ENGINE_TYPE=yaksa helps with a small reproducer, however the full app still crashes on 342 nodes in implicit scaling mode (one rank per GPU, 6 ranks per node)...