Swap redistribute.f90 with MPI_alltoall
The Barcelona team mentioned that the redistribution routines uses for the mirror term might be better performed using the MPI_alltoall library functions, rather than the routines in redistribute.f90. I don't actually have a feel for how much of an improvement we can get, but I do recall them saying that a good chunk of the redistribute time was actually latency, rather than transmission, so this change could be worth a shot.
Hi, I also investigated this problem on two different machines, one using omnipath and the other using ethernet as network. The later one is slower by a factor of roughly 3 to 5, even though benchmarks suggest that both networks are equal for moderate sized (0.25 MB) packages.
Is that for the redistribute.f90 routines or MPI_alltoall?
I did not try the standard MPI_alltoall since the slowing down appears within the procedure "c_redist_35" and the "c_redist_35" which from my point of view does something a little bit different than MPI_alltoall.
OK. I think the way the distribution function is chopped up, we would actually need to use MPI_alltoallv. I haven't much experience setting up an MPI_alltoallv myself... maybe I'll ping the Barcelona team to see if they had any more thoughts.
I've had similar suggestions from the Archer2 cse team for GS2 as well. It was also suggested that MPI_Neighbor_alltoall is the 'correct' approach for such patterns but that this is often implemented in a similar way to the code in redistribute.
In GS2 the redistribute is often not a dense all to all (i.e. most processors only communicate with a small fraction of the others) if running on a sensible core count. In a pathological case where each processor talked with a large fraction of the others I found the code grinding to a halt with the default network backend on Archer2. Switching the backend to UCX improved this massively without slowing down other communications. Just mentioning here in case it is helpful.
UCX improved it a lot but the main problem is the size of the packages. A bunch of small packages is sent which is unefficient for most machines. I am also not sure if every processor needs the whole grid which would cut down the traffic by a lot.
I am also not sure if every processor needs the whole grid which would cut down the traffic by a lot.
This issue isn't probably where we want to discuss the domain decomposition of stella. That being said, what sort of decomposition did you have in mind?