osu_gather failure at 512
cd /lus/flare/projects/Aurora_testing/mpi/osu_rfm/run_collective/512/gather-gather_persistent-gatherv-gatherv_persistent/stage/2025-09-12_18-37-26/aurora/compute/PrgEnv-intel/RunMPIcollective
awk 'BEGIN{N=5} {if(prev~/Lat(us)/&&/Sat/){for(i=NR-N;i<NR;i++)if(i>0)print buffer[i%N];print $0;count=N} else if(count>0){print $0;count--} buffer[NR%N]=$0; prev=$0}' rfm_job.out
gives the calls that did not return properly
Error signature: x4213c4s7b0n0.hsn.cm.aurora.alcf.anl.gov: rank 27421 died from signal 6 x4417c3s5b0n0.hsn.cm.aurora.alcf.anl.gov: rank 48109 died from signal 15
I suspect this is the same network jam issue.
I am running into this same signal 6 failure in issue 7645 as well on the MPI_File_write_all in my 'alltoall transfer' ior if I increase the collective buffer size to 64MB running at 256 nodes 64ppn so all 16k ranks are sending 4k message to each aggregator.