Open MPI make fails with UCX, undefined reference to `ucp_tag_recv_nbx'
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
- Open MPI: v5.0.0rc9
- UCX: v1.13.0
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
UCX is built successfully with:
git clone --recursive https://github.com/openucx/ucx.git
cd ucx
git checkout v1.13.0
git submodule update
./autogen.sh 2>&1 | tee autogen.out
./configure --prefix=$BUILD_DIR \
--with-cuda=$CUDA_HOME \
--disable-assertions \
--disable-debug \
--disable-logging \
--disable-params-check \
--enable-compiler-opt=3 \
--enable-devel-headers \
--enable-mt \
--enable-optimizations 2>&1 | tee config-release.out
make -j32 all 2>&1 | tee make-release.out
make -j32 install 2>&1 | tee install-release.out
Ompi:
git clone --recursive https://github.com/open-mpi/ompi.git
cd ompi
git checkout v5.0.0rc9
git submodule update
perl autogen.pl --no-oshmem 2>&1 | tee autogen.out
./configure --prefix=$BUILD_DIR \
--disable-io-romio \
--disable-io-ompio \
--disable-mpi-fortran \
--disable-oshmem \
--enable-mca-no-build=btl-portals4,coll-hcoll \
--with-cuda=$CUDA_HOME \
--with-devel-headers \
--with-hwloc=internal \
--with-libevent=internal \
--with-pmix=internal \
--with-prrte=internal \
--enable-mca-dso=coll-cuda\
--enable-mca-static=coll-cuda\
--with-ucx=$BUILD_DIR 2>&1 | tee config-release.out
make -j32 all 2>&1 | tee make-release.out
make -j32 install 2>&1 | tee install-release.out
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.
$ git submodule status
b22eca2c462b61533572634a0abbf212f283578d 3rd-party/openpmix (v4.2.2rc2-1-gb22eca2c)
ab03675e5a9014418418555ceb188d2573713870 3rd-party/prrte (v3.0.0rc3-1-gab03675e5a)
Please describe the system on which you are running
- Cluster: Mist of Compute Canada
- Operating system/version: Linux 4.18.0-305.72.1.el8_4.ppc64le
- Computer hardware: Each node has 32 IBM Power9 cores, 256GB RAM, and 4 NVIDIA V100-SMX2-32GB GPUs
- Network type: InfiniBand EDR
Details of the problem
Ompi build fails at make with this error message, complaining about unresolved dependencies:
Making all in tools/ompi_info
make[2]: Entering directory '/gpfs/fs0/project/q/queenspp/sojoodi/OpenMPI-Release/ompi/ompi/tools/ompi_info'
CC ompi_info.o
CC param.o
CCLD ompi_info
/project/q/queenspp/sojoodi/OpenMPI-Release/ompi/opal/.libs/libopen-pal.so: undefined reference to `ucm_test_external_events'
../../../ompi/.libs/libmpi.so: undefined reference to `ucp_tag_recv_nbx'
../../../ompi/.libs/libmpi.so: undefined reference to `ucp_tag_send_nbx'
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:1356: ompi_info] Error 1
make[2]: Leaving directory '/gpfs/fs0/project/q/queenspp/sojoodi/OpenMPI-Release/ompi/ompi/tools/ompi_info'
make[1]: *** [Makefile:2682: all-recursive] Error 1
make[1]: Leaving directory '/gpfs/fs0/project/q/queenspp/sojoodi/OpenMPI-Release/ompi/ompi'
make: *** [Makefile:1409: all-recursive] Error 1
Also, after applying the solution discussed here:
Adding
LIBS="-lucm -lucs"to ompi configure command.
the issue persists
Interestingly, setting LDFLAGS before running configure resolved the problem.
Shouldn't it automatically look in this directory for libs? 🤔
export LDFLAGS="-L$BUILD_DIR/lib"
./configure --prefix=$BUILD_DIR \
--disable-io-romio \
--disable-io-ompio \
--disable-mpi-fortran \
--disable-oshmem \
--enable-mca-no-build=btl-portals4,coll-hcoll \
--with-cuda=$CUDA_HOME \
--with-devel-headers \
--with-hwloc=internal \
--with-libevent=internal \
--with-pmix=internal \
--with-prrte=internal \
--enable-mca-dso=coll-cuda\
--enable-mca-static=coll-cuda\
--with-ucx=$BUILD_DIR 2>&1 | tee config-release.out
Shouldn't it automatically look in this directory for libs? 🤔
Yes.
@open-mpi/ucx please have a look.
@amirsojoodi i've tried the above commands and it worked ok for me (on CentOS 7.9) can you pls post the output of
cd ompi
grep pml_ucx config.status
@yosefe: Thanks for the follow up.
$ grep pml_ucx config.status
S["MCA_oshmem_spml_STATIC_LTLIBS"]="mca/spml/ucx/libmca_spml_ucx.la "
S["MCA_BUILD_oshmem_spml_ucx_DSO_FALSE"]=""
S["MCA_BUILD_oshmem_spml_ucx_DSO_TRUE"]="#"
S["spml_ucx_LIBS"]="-lucp -luct -lucs -lucm "
S["spml_ucx_LDFLAGS"]=""
S["spml_ucx_CPPFLAGS"]=""
S["MCA_ompi_pml_STATIC_LTLIBS"]="mca/pml/v/libmca_pml_v.la mca/pml/ucx/libmca_pml_ucx.la mca/pml/ob1/libmca_pml_ob1.la mca/pml/cm/libmca_pml_cm.la "
S["MCA_BUILD_ompi_pml_ucx_DSO_FALSE"]=""
S["MCA_BUILD_ompi_pml_ucx_DSO_TRUE"]="#"
S["pml_ucx_LIBS"]="-lucp -luct -lucs -lucm "
S["pml_ucx_LDFLAGS"]=""
S["pml_ucx_CPPFLAGS"]=""
I am on a PowerPC machine with RedHat 8.4
BTW, even now that I can build ompi, I have weird Segfaults. I'll post them in the next comment.
$ mpirun --mca pml ucx --mca btl ^smcuda,vader,openib,uct \
-x UCX_TLS=rc,sm,cuda_copy,gdr_copy,cuda_ipc \
-np 2 $BUILD_DIR/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw \
--window-size 1 -m 2097152:67108864 H H
[mist-login01:4176932:0:4176932] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11)
==== backtrace (tid:4176932) ====
=================================
--------------------------------------------------------------------------
prterun noticed that process rank 1 with PID 0 on node mist-login01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
The error doesn't change with CUDA benchmarks. It seems there is a problem in MPI_Init. I tried MPI_THREAD_SINGLE, too, but no luck. Also, I rebuilt ucx without --enable-mt, no luck again.
However, changing pml from ucx to ob1 somehow works.
mpirun --mca pml ob1 --mca btl '^vader,tcp,openib,uct' -np 2 \
/project/q/queenspp/sojoodi/OpenMPI-Release/build/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw \
--window-size 1 -m 2097152:67108864 H H
@amirsojoodi i've tried on Centos 8.4 and it works fine for me. Maybe there is an older version of UCX installed on your system? Can you please upload (or email me) config.log and config-release.out files from OMPI?
@yosefe Sorry for the late reply Yossi.
I finally got it to work with UCX. I had to specifically disable a bunch of modules. I don't exactly know which fixed the issue, but I'll provide the commands here, in case:
- UCX:
git clone --recursive https://github.com/openucx/ucx.git
cd ucx
git checkout v1.13.0
git submodule update --init --recursive
./autogen.sh 2>&1 | tee autogen.out
./configure --prefix=$BUILD_DIR \
--with-cuda=$CUDA_HOME \
--disable-assertions \
--disable-debug \
--disable-params-check \
--without-knem \
--without-xpmem \
--without-ofi \
--with-mlx5-dv \
--enable-logging \
--enable-compiler-opt=3 2>&1 | tee config-release.out
make -j32 all 2>&1 | tee make-release.out
make -j32 install 2>&1 | tee install-release.out
- Ompi :
git clone --recursive https://github.com/open-mpi/ompi.git
cd ompi
git checkout v5.0.0rc9
git submodule update --init --recursive
perl autogen.pl -j 32 2>&1 | tee autogen.out
./configure --prefix=$BUILD_DIR \
--disable-io-romio \
--disable-io-ompio \
--disable-mpi-fortran \
--disable-oshmem \
--enable-mca-no-build=btl-uct,btl-portals4,btl-ofi \
--without-ofi \
--without-portals4 \
--without-ugni \
--without-knem \
--with-cuda=$CUDA_HOME \
--with-cuda-libdir=$CUDA_COMPAT_PATH \
--with-devel-headers \
--with-hwloc=internal \
--with-libevent=internal \
--with-pmix=internal \
--with-prrte=internal \
--with-ucx=$BUILD_DIR \
--with-ucx-libdir=$BUILD_DIR/lib 2>&1 | tee config-release.out
make -j32 all 2>&1 | tee make-release.out
make -j32 install 2>&1 | tee install-release.out
I am really tired of this right now, if I get a chance to figure out which one exactly fixed the issue, I'll post another follow-up comment/issue. Thanks for the help @yosefe and @jsquyres
@yosefe : For an update, I used UCX v1.12.1, and the previous configs were just working fine. I had to disable hcoll during runtime, but other than that, everything was fine, CUDA/Host pt2pt/coll.
Updating UCX from 1.12.1 to 1.13.1 or newer just caused this weird error (similar to the previous one):
[mist-login01:3653042:0:3653042] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11)
==== backtrace (tid:3653042) ====
=================================
[mist-login01:3653043:0:3653043] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11)
==== backtrace (tid:3653043) ====
=================================
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 0 on node mist-login01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Maybe GCC (10.3.0) or CUDA (11.2.2) version mismatch... No idea. Anyway, I don't know if I should close this issue or not, so I leave it open.
@amirsojoodi when you updated UCX did you also rebuild OpenMPI on top of it?
@amirsojoodi when you updated UCX did you also rebuild OpenMPI on top of it?
Yes I did.