The problem of using openmpi-v4 of toolchain to compile abacus-develop-LTSv3.10.0
Describe the bug
pcc025:rank16.abacus: Failed to modify UD QP to INIT on mlx5_0: Operation not permitted pcc025:rank22.abacus: Failed to modify UD QP to INIT on mlx5_0: Operation not permitted
Open MPI failed an OFI Libfabric library call (fi_endpoint). This is highly unusual; your job may behave unpredictably (and/or abort) after this.
Local host: pcc025 Location: mtl_ofi_component.c:515 Error: Invalid argument (22)
Could anyone help me solve these issues?
Thanks. Best regards, Zhao
Expected behavior
No response
To Reproduce
No response
Environment
No response
Additional Context
No response
Task list for Issue attackers (only for developers)
- [ ] Verify the issue is not a duplicate.
- [ ] Describe the bug.
- [ ] Steps to reproduce.
- [ ] Expected behavior.
- [x] Error message.
- [ ] Environment details.
- [ ] Additional context.
- [ ] Assign a priority level (low, medium, high, urgent).
- [ ] Assign the issue to a team member.
- [ ] Label the issue with relevant tags.
- [ ] Identify possible related issues.
- [ ] Create a unit test or automated test to reproduce the bug (if applicable).
- [ ] Fix the bug.
- [ ] Test the fix.
- [ ] Update documentation (if necessary).
- [ ] Close the issue and inform the reporter (if applicable).
Hi @summitmoon You can try toolchain 202502, which is merged in the develop branch, and you can just replace the toolchain directory in LTS repo for using. For better debug, please tell us your server environment, like:
- gcc version
- OS version
- toolchain usage Thanks!
The gcc version is gcc-12.3.0;
The toolchain is toolchain_gnu.sh with:
./install_abacus_toolchain.sh
--with-gcc=system
--with-intel=no
--with-openblas=install
--with-openmpi=system
--with-cmake=install
--with-scalapack=install
--with-libxc=install
--with-fftw=install
--with-elpa=install
--with-cereal=install
--with-rapidjson=install
--with-libtorch=no
--with-libnpy=no
--with-libri=no
--with-libcomm=no
--with-4th-openmpi=no \
The openmpi is 4.1.7, and the OS version is NAME="Rocky Linux" VERSION="8.10 (Green Obsidian)".
@summitmoon There may be some problem in your OpenMPI, please try using --with-openmpi=install
It is the same problem on the installation of ELPA when I used --with-openmpi=install
Can you take a screenshot of the error message from ELPA? What environments did you load during installation? @summitmoon