perf: MPI_Bcast show significant variance (default is very bad)
Overview
I’m seeing significant differences in performance in the an application code on Aurora depending on the MPIR_CVAR_BCAST_INTRA_ALGORITHM used.
On 512 nodes:
MPIR_CVAR_BCAST_INTRA_ALGORITHM=auto : 30.4 s
MPIR_CVAR_BCAST_INTRA_ALGORITHM=binomial: 1.1 s.
MPIR_CVAR_BCAST_INTRA_ALGORITHM=nb : 1.8 s.
MPIR_CVAR_BCAST_INTRA_ALGORITHM=smp: 28.9 s.
MPIR_CVAR_BCAST_INTRA_ALGORITHM=scatter_recursive_doubling_allgather: 0.9 s.
MPIR_CVAR_BCAST_INTRA_ALGORITHM=scatter_ring_allgather: 31.9 s.
MPIR_CVAR_BCAST_INTRA_ALGORITHM=pipelined_tree: 1.0 s.
MPIR_CVAR_BCAST_INTRA_ALGORITHM=tree: 1.0 s.
The timing difference comes from a section of the code does data exchange between ranks via MPI_Bcast calls. A simpler reproducer is below.
Reproducer
A smaller reproducer that’s similar (but not exactly) like what the application code is is:
> cat bcast.cpp
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
#include <omp.h>
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
// Get my rank in the communicator
int my_rank;
int size;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
int buffer;
int nthread_per_group=12;
int ngroups = size/nthread_per_group;
if( my_rank == 0) printf("groups: %d and thread per group: %d\n", ngroups, nthread_per_group);
int m2t=152670;
double * output = (double*) malloc( sizeof(double)*m2t*ngroups);
double final,start;
MPI_Barrier(MPI_COMM_WORLD);
start=omp_get_wtime();
for(int i=0; i< ngroups; i++) {
MPI_Bcast(output+i*m2t,m2t*sizeof(double),MPI_BYTE,i*nthread_per_group,MPI_COMM_WORLD);
}
MPI_Barrier(MPI_COMM_WORLD);
final=omp_get_wtime();
if(my_rank == 0) printf( "time (s): %lf\n", final-start);
MPI_Finalize();
return 0;
}
Compile
mpicxx -fiopenmp -mcmodel=large -lm -ldl -g -O2 bcast.cpp -o bcast_cpp
Runscript for Aurora (on 512 nodes)
#!/bin/bash
#PBS -l select=512
#PBS -l walltime=0:45:00
#PBS -A performance
#PBS -q prod
#PBS -l filesystems=flare:home
cd ${PBS_O_WORKDIR}
set -x
NNODES=`wc -l < $PBS_NODEFILE`
export RANKS_PER_NODE_CPP=12 # Number of MPI ranks per node
NRANKS_CPP=$(( NNODES * RANKS_PER_NODE_CPP ))
envs=( auto binomial nb smp scatter_recursive_doubling_allgather scatter_ring_allgather pipelined_tree tree)
for e in ${envs[@]}; do
echo "time $e"
MPIR_CVAR_BCAST_INTRA_ALGORITHM=$e mpiexec -n ${NRANKS_CPP} -ppn ${RANKS_PER_NODE_CPP} --cpu-bind depth -d 8 ./bcast_cpp
done
Output for Reproducer
time auto
time (s): 545.282936
time binomial
time (s): 2.528068
time nb
time (s): 5.352969
time smp
time (s): 544.941390
time scatter_recursive_doubling_allgather
time (s): 1.466150
time scatter_ring_allgather
time (s): 544.884321
time pipelined_tree
time (s): 3.060392
time tree
time (s): 1.625567
The reproducer is not exactly the same times as the full application, but the general trend of auto, smp, and scatter_ring_allgather being the slowest is there. It seems that the default Bcast choice is not the best for this test case.
I'm not suggesting the default be changed, since this is just one test case, but I wanted to make the team aware.
The scatter_ring_allgather algorithm splits the "large" message into chunks then runs P-1 rounds of ring circling. Obviously when P here is too large (512 x 12) we are over splitting the chunks and each round of ring is latency bounded. The issue is the default "auto" algorithm only makes choice based on message size. Obviously we need limit on the communicator size as well. The ideal algorithm is the "k-ring" algorithm that we can tune between chunk size and the number processes. I believe @mjwilkins18 has this algorithm already, let's consider get it into production. Meanwhile, a quick patch to add limit to the comm_size should fix this issue.