mpich icon indicating copy to clipboard operation
mpich copied to clipboard

osu_gather and osu_gather_persistent fails starting 8 nodes 96 ppn

Open longfei-austin opened this issue 3 months ago • 3 comments

Example:

mpiexec --np 768 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 --gpu-bind verbose,list:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.1:0.1:0.1:0.1:0.1:0.1:0.1:0.1:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.1:1.1:1.1:1.1:1.1:1.1:1.1:1.1:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.1:2.1:2.1:2.1:2.1:2.1:2.1:2.1:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.1:3.1:3.1:3.1:3.1:3.1:3.1:3.1:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.1:4.1:4.1:4.1:4.1:4.1:4.1:4.1:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.1:5.1:5.1:5.1:5.1:5.1:5.1:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/8/gather-gather_persistent-gatherv-gatherv_persistent/stage/2025-09-13_12-50-22/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_gather -m 4096:4096 -i 1000 -x 100 -f -z -d sycl

x4101c2s0b0n0.hsn.cm.aurora.alcf.anl.gov: rank 256 died from signal 11 x4117c4s0b0n0.hsn.cm.aurora.alcf.anl.gov: rank 32 died from signal 15

module load mpich-config/collective-tuning/1024 ; MPIR_CVAR_CH4_PROGRESS_THROTTLE=1 ; mpiexec --np 768 --ppn 96 --cpu-bind verbose,list:1:2:3:4:5:6:7:8:9:10:11:12:13:14:15:16:18:19:20:21:22:23:24:25:26:27:28:29:30:31:32:33:35:36:37:38:39:40:41:42:43:44:45:46:47:48:49:50:53:54:55:56:57:58:59:60:61:62:63:64:65:66:67:68:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102 --gpu-bind verbose,list:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.0:0.1:0.1:0.1:0.1:0.1:0.1:0.1:0.1:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.0:1.1:1.1:1.1:1.1:1.1:1.1:1.1:1.1:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.0:2.1:2.1:2.1:2.1:2.1:2.1:2.1:2.1:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.0:3.1:3.1:3.1:3.1:3.1:3.1:3.1:3.1:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.0:4.1:4.1:4.1:4.1:4.1:4.1:4.1:4.1:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.0:5.1:5.1:5.1:5.1:5.1:5.1:5.1:5.1 /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/8/gather-gather_persistent-gatherv-gatherv_persistent/stage/2025-09-13_12-50-22/aurora/compute/PrgEnv-intel/BuildMPIcollective_93bceebc/binaries/osu_gather -m 4096:4096 -i 1000 -x 100 -f -z -d sycl ; unset MPIR_CVAR_CH4_PROGRESS_THROTTLE ; module unload mpich-config/collective-tuning/1024

x4101c2s0b0n0.hsn.cm.aurora.alcf.anl.gov: rank 256 died from signal 11 x4103c7s2b0n0.hsn.cm.aurora.alcf.anl.gov: rank 434 died from signal 15

longfei-austin avatar Sep 30 '25 04:09 longfei-austin

On 8 nodes, 4 errors On 32 nodes, 8 errors On 128 nodes, 16 errors On 512 nodes, 52 errors

So, might be related to the total number of ranks.

longfei-austin avatar Sep 30 '25 04:09 longfei-austin

cd /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/8/gather-gather_persistent-gatherv-gatherv_persistent/stage/2025-09-13_12-50-22/aurora/compute/PrgEnv-intel/RunMPIcollective/

tail rfm_job.out

longfei-austin avatar Oct 09 '25 21:10 longfei-austin

We found the bug:

  • MPI_Gather with sufficient message size and sufficient number of processes, it goes to the gather_intra_binomial algorithm, which does message combination. The algorithm constructs a struct type combining the original sendbuf and tmpbuf using MPI_BOTTOM
  • MPIR_CVAR_CH4_OFI_EAGER_THRESHOLD=1000000 is set so the above message goes to the ofi rndv path.
  • Because it is non-contiguous struct type, it picks the pipeline algorithm
  • The pipiline algorithm uses yaksa to pack the chunks
  • Because it is a struct type with MPI_BOTTOM, a previous hack checks gpu attributes for individual segment so that it can optimize inside yaksa. The attributes is packed into an yaksa info
  • Another separate optimization makes the yaksa info NULL if it is not from device.
  • Everything works until at the completion the code blindly calls yaksa to free the info
  • it segfaults because the info is NULL

We should be able to reproduce the bug with a single node as long as there are sufficient message size and number of processes... Let me see if we can catch it in our test suite ...

hzhou avatar Oct 23 '25 19:10 hzhou