rma: Improve scalability of MPI_Win_allocate_shared
Discussed in https://github.com/pmodels/mpich/discussions/5931
Originally posted by ashwinraghu April 6, 2022 A field report noted the execution time of MPI_Win_allocate_shared() increasing with the amount of memory being allocated. e.g. on a single node with 4 ranks, it takes 3 secs for 8GB, 6 secs for 16GB and 12 secs for 32GB. I did some instrumentation of the 3.4.x implementation using callgrind. The resulting report shows the implementation spending a good part of the time executing instructions in the system call msync() called in turn from MPIDIU_get_shm_symheap()->generate_random_addr().->check_maprange_ok()
(Column 1 is the number of instructions) 520,123,460 (96.79%) < ???:MPIDIU_get_shm_symheap (1x) 251,658,345 (46.83%) * ???:generate_random_addr 218,103,795 (40.59%) > ???:msync (16,777,215x) 50,331,648 ( 9.37%) > ???:__errno_location (16,777,216x)
Based on my reading of check_maprange_ok(), it's trying to detect a virtual address range of a given size that is free/unmapped. The base address of this range is then bcast'd and used to map the shared memory across all processes.
The question: what is the motive for trying to ensure that the base address is identical across all processes, given that any access via load and store operations on the shared memory is to be done only after fetching the base address via MPI_Win_shared_query()? It's also apparent that the fall-back method uses any available address returned by mmap() anyway.
Based on the discussion 5931, I am suggesting that MPI_Win_allocate_shared() be made more scalable, possibly by removing the checks for symmetric heap across ranks.
Note that MPI_Win_allocate() will still need to ensure the symmetry of the allocation across ranks and so check_maprange_ok() may yet need to be improved upon. E.g. doing msync() on multiple pages instead of one at a time might be worth looking at.
I am pretty sure it is not intentional to try symmetric allocation on the shared window since it only has a single allocation anyway. I believe this is either a bug in the original implementation or we have somehow introduced the bug at some point.
on a single node with 4 ranks, it takes 3 secs for 8GB, 6 secs for 16GB and 12 secs for 32GB.
This is curious. I suspect the proportion might be co-incidence. A larger size may result in more attempts to find a suitable range from random generations but I wouldn't expect an exact proportion. We probably should be smarter on the random part. For example, if we survey the memory mapping and only try the random address from suitable gaps, we may succeed more easily.
@ashwinraghu Could you test https://github.com/pmodels/mpich/pull/5966?
@ashwinraghu Could you test #5966?
First attempt at cherry-picking wasn't successful; some conflicts to deal with. Busy with some high priority issues currently as well..