ompi perf: hcoll: mpi_get_accumulate test slows down in mpi_win

perf: hcoll: mpi_get_accumulate test slows down in mpi_win_fence when hcoll is enabled

Open AboorvaDevarajan opened this issue 3 years ago • 0 comments

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OMPI main

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

$ git submodule status
 73aff171cb53659643c2a341a627f4ce1178df60 3rd-party/openpmix (v1.1.3-3585-g73aff171)
 154b42b6c292f965987ed994e42f9ea4b4e8e072 3rd-party/prrte (psrvr-v2.0.0rc1-4391-g154b42b6c2)

Please describe the system on which you are running

Operating system/version: RHEL8.4
Computer hardware: x86_64, ppc64le
Network type: none (shmem)
MOFED version: MLNX_OFED_LINUX-5.0-2.1.8.0

Details of the problem

Test to recreate the issue:

https://github.com/AboorvaDevarajan/mpi-snippets/blob/main/get_accumulate/get_accu2.c

Steps to run:

$ time  mpirun -np  2 --mca osc rdma ./test

CASE 3: count: 100000 PASS
CASE 3: count: 100000 PASS
CASE 3: count: 100000 PASS

real    0m17.296s
user    0m29.303s
sys     0m2.806s


[smpici@c685f8n02 get_accumulate]$ time  mpirun -np  2 --mca osc rdma --mca coll ^hcoll ./test

CASE 3: count: 100000 PASS
CASE 3: count: 100000 PASS
CASE 3: count: 100000 PASS

real    0m1.685s
user    0m0.850s
sys     0m1.264s

Summary:

Overhead of HCOLL seems to be around 10x worse than when HCOLL is disabled.

COLL	perf (time taken)
without HCOLL	1.6s
with HCOLL	17.2s

Here is the perf logs on where the overhead is:

-   47.52%    47.52%  test     hmca_bcol_basesmuma.so          [.] hmca_bcol_basesmuma_barrier_toplevel_prog
     47.52% 0                                                                                               
        __libc_start_main                                                                                   
        generic_start_main.isra.0                                                                           
        main                                                                                                
        PMPI_Win_fence                                                                                      
        ompi_osc_rdma_fence_atomic                                                                          
        mca_coll_hcoll_barrier                                                                              
        hmca_coll_ml_barrier_intra                                                                          
        hmca_bcol_basesmuma_barrier_toplevel_progress_POWER

The overhead seems to be in mca_coll_hcoll_barrier in MPI_Win_Fence disabling HCOLL_ML_USE_SHMSEG_BARRIER improves the performance.

COLL	perf (time taken)
without HCOLL	1.6s
with HCOLL	17.2s
with HCOLL (HCOLL_ML_USE_SHMSEG_BARRIER=0)	4.7s

Aug 17 '22 06:08 AboorvaDevarajan

ompi ompi copied to clipboard

perf: hcoll: mpi_get_accumulate test slows down in mpi_win_fence when hcoll is enabled

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Please describe the system on which you are running

Details of the problem

ompi
ompi copied to clipboard

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.