ompi icon indicating copy to clipboard operation
ompi copied to clipboard

perf: hcoll: mpi_get_accumulate test slows down in mpi_win_fence when hcoll is enabled

Open AboorvaDevarajan opened this issue 3 years ago • 0 comments

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

OMPI main

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

$ git submodule status
 73aff171cb53659643c2a341a627f4ce1178df60 3rd-party/openpmix (v1.1.3-3585-g73aff171)
 154b42b6c292f965987ed994e42f9ea4b4e8e072 3rd-party/prrte (psrvr-v2.0.0rc1-4391-g154b42b6c2)

Please describe the system on which you are running

  • Operating system/version: RHEL8.4
  • Computer hardware: x86_64, ppc64le
  • Network type: none (shmem)
  • MOFED version: MLNX_OFED_LINUX-5.0-2.1.8.0

Details of the problem

Test to recreate the issue:

https://github.com/AboorvaDevarajan/mpi-snippets/blob/main/get_accumulate/get_accu2.c

Steps to run:

$ time  mpirun -np  2 --mca osc rdma ./test

CASE 3: count: 100000 PASS
CASE 3: count: 100000 PASS
CASE 3: count: 100000 PASS

real    0m17.296s
user    0m29.303s
sys     0m2.806s


[smpici@c685f8n02 get_accumulate]$ time  mpirun -np  2 --mca osc rdma --mca coll ^hcoll ./test

CASE 3: count: 100000 PASS
CASE 3: count: 100000 PASS
CASE 3: count: 100000 PASS

real    0m1.685s
user    0m0.850s
sys     0m1.264s

Summary:

Overhead of HCOLL seems to be around 10x worse than when HCOLL is disabled.

COLL perf (time taken)
without HCOLL 1.6s
with HCOLL 17.2s

Here is the perf logs on where the overhead is:

-   47.52%    47.52%  test     hmca_bcol_basesmuma.so          [.] hmca_bcol_basesmuma_barrier_toplevel_prog
     47.52% 0                                                                                               
        __libc_start_main                                                                                   
        generic_start_main.isra.0                                                                           
        main                                                                                                
        PMPI_Win_fence                                                                                      
        ompi_osc_rdma_fence_atomic                                                                          
        mca_coll_hcoll_barrier                                                                              
        hmca_coll_ml_barrier_intra                                                                          
        hmca_bcol_basesmuma_barrier_toplevel_progress_POWER  

The overhead seems to be in mca_coll_hcoll_barrier in MPI_Win_Fence disabling HCOLL_ML_USE_SHMSEG_BARRIER improves the performance.

COLL perf (time taken)
without HCOLL 1.6s
with HCOLL 17.2s
with HCOLL (HCOLL_ML_USE_SHMSEG_BARRIER=0) 4.7s

AboorvaDevarajan avatar Aug 17 '22 06:08 AboorvaDevarajan