WEIS icon indicating copy to clipboard operation
WEIS copied to clipboard

Open MPI dependency broken (temporary resolution available)

Open yonghoonlee opened this issue 1 year ago • 4 comments

Description

Open MPI creates segmentation faults on Linux machines.

Steps to reproduce issue

  1. Follow the WEIS develop branch installation instruction, then mpi=1.0-openmpi version is installed along with mpi4py=3.1.6 version.

  2. Run any MPI job, it will fail. More (generalized) information found on this page

  3. Temporary resolution to this issue is to install mpich version of mpi instead of openmpi. When installing the WEIS, install specific build of mpi:

conda install -y petsc4py mpi4py mpi=1.0=mpich pyoptsparse     # (Mac / Linux only)

instead of running

conda install -y petsc4py mpi4py pyoptsparse     # (Mac / Linux only)

to forcefully install mpich version of mpi.

Current behavior

If installed without specifying mpich then:

[log02:599094:0:599094] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:599097:0:599097] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:599095:0:599095] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)
[log02:599096:0:599096] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x3ff00000001)

Expected behavior

Code versions

yonghoonlee avatar Jul 24 '24 09:07 yonghoonlee

At this moment, the issue can be temporarily circumvented by specifying mpich as described above. I hope the dependency issue can be resolved from the conda-forge feedstock level soon.

yonghoonlee avatar Jul 24 '24 09:07 yonghoonlee

@yonghoonlee, is this happening on Kestrel or another linux machine?

dzalkind avatar Jul 31 '24 20:07 dzalkind

@dzalkind Yes, it happened in all my Linux machines as well as HPC systems I am currently using, including Kestrel and UofM HPC. When you specify mpich then it works fine. Otherwise, conda automatically select MPI dependencies, and it could be either openmpi or mpich. If openmpi is automatically selected, then the problem persists.

There are two workarounds I found (and tested) based on the discussions I had with mpi4py and openmpi communities:

  1. Use mpich instead of openmpi: Install mpich along with mpi4py. Then openmpi will not be installed.
  2. Install ucx along with openmpi: It seems that certain version of ucx installed on many linux distributions (both Debian and RedHat based distros) create issue with certain version of openmpi. Install ucx along with mpi4py, then openmpi installed with mpi4py will work fine with the most up-to-date version of ucx.

yonghoonlee avatar Jul 31 '24 20:07 yonghoonlee

Solution from @yonghoonlee

conda install -y petsc4py mpi4py mpich pyoptsparse # (Mac / Linux only)

(also install mpich)

dzalkind avatar Aug 06 '24 15:08 dzalkind