ompi
ompi copied to clipboard
Performance regression in atmospheric model when using OpenMPI – traced to slow memset
When running our atmospheric model in parallel we observe a significant performance drop with OpenMPI compared to Intel MPI + Intel compiler. Profiling shows that almost all of the extra time is spent inside memset. Observations With Intel MPI + Intel compiler the code runs at expected speed. With OpenMPI (regardless of which compiler is used to build OpenMPI and the model) the same executable becomes ≥ 2× slower, and the profiler attributes the loss almost entirely to memset. The problem persists across several recent OpenMPI releases (tested 4.1.x and 5.0.x). Steps to reproduce Build the atmospheric model with any compiler (Intel or LLVM) against Intel MPI → run time ≈ T₀ (baseline). Re-build the identical source against OpenMPI (any recent version) → run time ≈ (2~4)× T₀. Profile (perf, VTune, or gprof) shows > 80 % of the extra time is consumed by memset. Environment OS: RHEL 7 Compilers tested: Intel 2021.11, LLVM 17 MPIs tested: – Intel MPI 2021.11 – OpenMPI 4.1.6, 5.0.1 (both built from source and distro packages) Expected behavior memset cost should remain small regardless of MPI implementation, so OpenMPI performance should match Intel MPI. Actual behavior memset becomes the bottleneck under OpenMPI. Additional notes No special memset tuning flags are used in either case.
Please let me know if you need any additional information (build options, reproducer, or profiling data).