stella icon indicating copy to clipboard operation
stella copied to clipboard

Swap redistribute.f90 with MPI_alltoall

Open DenSto opened this issue 3 years ago • 7 comments

The Barcelona team mentioned that the redistribution routines uses for the mirror term might be better performed using the MPI_alltoall library functions, rather than the routines in redistribute.f90. I don't actually have a feel for how much of an improvement we can get, but I do recall them saying that a good chunk of the redistribute time was actually latency, rather than transmission, so this change could be worth a shot.

DenSto avatar Jun 06 '22 01:06 DenSto

Hi, I also investigated this problem on two different machines, one using omnipath and the other using ethernet as network. The later one is slower by a factor of roughly 3 to 5, even though benchmarks suggest that both networks are equal for moderate sized (0.25 MB) packages.

SStroteich avatar Jun 06 '22 08:06 SStroteich

Is that for the redistribute.f90 routines or MPI_alltoall?

DenSto avatar Jun 06 '22 10:06 DenSto

I did not try the standard MPI_alltoall since the slowing down appears within the procedure "c_redist_35" and the "c_redist_35" which from my point of view does something a little bit different than MPI_alltoall.

SStroteich avatar Jun 06 '22 10:06 SStroteich

OK. I think the way the distribution function is chopped up, we would actually need to use MPI_alltoallv. I haven't much experience setting up an MPI_alltoallv myself... maybe I'll ping the Barcelona team to see if they had any more thoughts.

DenSto avatar Jun 06 '22 12:06 DenSto

I've had similar suggestions from the Archer2 cse team for GS2 as well. It was also suggested that MPI_Neighbor_alltoall is the 'correct' approach for such patterns but that this is often implemented in a similar way to the code in redistribute.

In GS2 the redistribute is often not a dense all to all (i.e. most processors only communicate with a small fraction of the others) if running on a sensible core count. In a pathological case where each processor talked with a large fraction of the others I found the code grinding to a halt with the default network backend on Archer2. Switching the backend to UCX improved this massively without slowing down other communications. Just mentioning here in case it is helpful.

d7919 avatar Jun 07 '22 10:06 d7919

UCX improved it a lot but the main problem is the size of the packages. A bunch of small packages is sent which is unefficient for most machines. I am also not sure if every processor needs the whole grid which would cut down the traffic by a lot.

SStroteich avatar Aug 06 '22 10:08 SStroteich

I am also not sure if every processor needs the whole grid which would cut down the traffic by a lot.

This issue isn't probably where we want to discuss the domain decomposition of stella. That being said, what sort of decomposition did you have in mind?

DenSto avatar Aug 07 '22 17:08 DenSto