ompi
ompi copied to clipboard
perf: hcoll: mpi_get_accumulate test slows down in mpi_win_fence when hcoll is enabled
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
OMPI main
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
git clone
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
$ git submodule status
73aff171cb53659643c2a341a627f4ce1178df60 3rd-party/openpmix (v1.1.3-3585-g73aff171)
154b42b6c292f965987ed994e42f9ea4b4e8e072 3rd-party/prrte (psrvr-v2.0.0rc1-4391-g154b42b6c2)
Please describe the system on which you are running
- Operating system/version: RHEL8.4
- Computer hardware: x86_64, ppc64le
- Network type: none (shmem)
- MOFED version:
MLNX_OFED_LINUX-5.0-2.1.8.0
Details of the problem
Test to recreate the issue:
https://github.com/AboorvaDevarajan/mpi-snippets/blob/main/get_accumulate/get_accu2.c
Steps to run:
$ time mpirun -np 2 --mca osc rdma ./test
CASE 3: count: 100000 PASS
CASE 3: count: 100000 PASS
CASE 3: count: 100000 PASS
real 0m17.296s
user 0m29.303s
sys 0m2.806s
[smpici@c685f8n02 get_accumulate]$ time mpirun -np 2 --mca osc rdma --mca coll ^hcoll ./test
CASE 3: count: 100000 PASS
CASE 3: count: 100000 PASS
CASE 3: count: 100000 PASS
real 0m1.685s
user 0m0.850s
sys 0m1.264s
Summary:
Overhead of HCOLL seems to be around 10x worse than when HCOLL is disabled.
| COLL | perf (time taken) |
|---|---|
| without HCOLL | 1.6s |
| with HCOLL | 17.2s |
Here is the perf logs on where the overhead is:
- 47.52% 47.52% test hmca_bcol_basesmuma.so [.] hmca_bcol_basesmuma_barrier_toplevel_prog
47.52% 0
__libc_start_main
generic_start_main.isra.0
main
PMPI_Win_fence
ompi_osc_rdma_fence_atomic
mca_coll_hcoll_barrier
hmca_coll_ml_barrier_intra
hmca_bcol_basesmuma_barrier_toplevel_progress_POWER
The overhead seems to be in mca_coll_hcoll_barrier in MPI_Win_Fence disabling HCOLL_ML_USE_SHMSEG_BARRIER improves the performance.
| COLL | perf (time taken) |
|---|---|
| without HCOLL | 1.6s |
| with HCOLL | 17.2s |
| with HCOLL (HCOLL_ML_USE_SHMSEG_BARRIER=0) | 4.7s |