ompi icon indicating copy to clipboard operation
ompi copied to clipboard

Open MPI make fails with UCX, undefined reference to `ucp_tag_recv_nbx'

Open amirsojoodi opened this issue 2 years ago • 11 comments

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

  • Open MPI: v5.0.0rc9
  • UCX: v1.13.0

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

UCX is built successfully with:

git clone --recursive https://github.com/openucx/ucx.git
cd ucx
git checkout v1.13.0
git submodule update

./autogen.sh 2>&1 | tee autogen.out

./configure --prefix=$BUILD_DIR \
  --with-cuda=$CUDA_HOME \
  --disable-assertions \
  --disable-debug \
  --disable-logging \
  --disable-params-check \
  --enable-compiler-opt=3 \
  --enable-devel-headers \
  --enable-mt \
  --enable-optimizations 2>&1 | tee config-release.out

make -j32 all 2>&1 | tee make-release.out
make -j32 install 2>&1 | tee install-release.out

Ompi:

git clone --recursive https://github.com/open-mpi/ompi.git
cd ompi
git checkout v5.0.0rc9
git submodule update

perl autogen.pl --no-oshmem 2>&1 | tee autogen.out

./configure --prefix=$BUILD_DIR \
  --disable-io-romio \
  --disable-io-ompio \
  --disable-mpi-fortran \
  --disable-oshmem \
  --enable-mca-no-build=btl-portals4,coll-hcoll \
  --with-cuda=$CUDA_HOME \
  --with-devel-headers \
  --with-hwloc=internal \
  --with-libevent=internal \
  --with-pmix=internal \
  --with-prrte=internal \
  --enable-mca-dso=coll-cuda\
  --enable-mca-static=coll-cuda\
  --with-ucx=$BUILD_DIR 2>&1 | tee config-release.out

make -j32 all 2>&1 | tee make-release.out
make -j32 install 2>&1 | tee install-release.out

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

$ git submodule status
 b22eca2c462b61533572634a0abbf212f283578d 3rd-party/openpmix (v4.2.2rc2-1-gb22eca2c)
 ab03675e5a9014418418555ceb188d2573713870 3rd-party/prrte (v3.0.0rc3-1-gab03675e5a)

Please describe the system on which you are running

  • Cluster: Mist of Compute Canada
  • Operating system/version: Linux 4.18.0-305.72.1.el8_4.ppc64le
  • Computer hardware: Each node has 32 IBM Power9 cores, 256GB RAM, and 4 NVIDIA V100-SMX2-32GB GPUs
  • Network type: InfiniBand EDR

Details of the problem

Ompi build fails at make with this error message, complaining about unresolved dependencies:

Making all in tools/ompi_info
make[2]: Entering directory '/gpfs/fs0/project/q/queenspp/sojoodi/OpenMPI-Release/ompi/ompi/tools/ompi_info'
  CC       ompi_info.o
  CC       param.o
  CCLD     ompi_info
/project/q/queenspp/sojoodi/OpenMPI-Release/ompi/opal/.libs/libopen-pal.so: undefined reference to `ucm_test_external_events'
../../../ompi/.libs/libmpi.so: undefined reference to `ucp_tag_recv_nbx'
../../../ompi/.libs/libmpi.so: undefined reference to `ucp_tag_send_nbx'
collect2: error: ld returned 1 exit status
make[2]: *** [Makefile:1356: ompi_info] Error 1
make[2]: Leaving directory '/gpfs/fs0/project/q/queenspp/sojoodi/OpenMPI-Release/ompi/ompi/tools/ompi_info'
make[1]: *** [Makefile:2682: all-recursive] Error 1
make[1]: Leaving directory '/gpfs/fs0/project/q/queenspp/sojoodi/OpenMPI-Release/ompi/ompi'
make: *** [Makefile:1409: all-recursive] Error 1

amirsojoodi avatar Jan 31 '23 21:01 amirsojoodi

Also, after applying the solution discussed here:

Adding LIBS="-lucm -lucs" to ompi configure command.

the issue persists

amirsojoodi avatar Jan 31 '23 23:01 amirsojoodi

Interestingly, setting LDFLAGS before running configure resolved the problem.

Shouldn't it automatically look in this directory for libs? 🤔

export LDFLAGS="-L$BUILD_DIR/lib"

./configure --prefix=$BUILD_DIR \
  --disable-io-romio \
  --disable-io-ompio \
  --disable-mpi-fortran \
  --disable-oshmem \
  --enable-mca-no-build=btl-portals4,coll-hcoll \
  --with-cuda=$CUDA_HOME \
  --with-devel-headers \
  --with-hwloc=internal \
  --with-libevent=internal \
  --with-pmix=internal \
  --with-prrte=internal \
  --enable-mca-dso=coll-cuda\
  --enable-mca-static=coll-cuda\
  --with-ucx=$BUILD_DIR 2>&1 | tee config-release.out

amirsojoodi avatar Feb 01 '23 01:02 amirsojoodi

Shouldn't it automatically look in this directory for libs? 🤔

Yes.

@open-mpi/ucx please have a look.

jsquyres avatar Feb 01 '23 12:02 jsquyres

@amirsojoodi i've tried the above commands and it worked ok for me (on CentOS 7.9) can you pls post the output of

cd ompi
grep pml_ucx config.status

yosefe avatar Feb 02 '23 16:02 yosefe

@yosefe: Thanks for the follow up.

$ grep pml_ucx config.status
S["MCA_oshmem_spml_STATIC_LTLIBS"]="mca/spml/ucx/libmca_spml_ucx.la "
S["MCA_BUILD_oshmem_spml_ucx_DSO_FALSE"]=""
S["MCA_BUILD_oshmem_spml_ucx_DSO_TRUE"]="#"
S["spml_ucx_LIBS"]="-lucp -luct -lucs -lucm "
S["spml_ucx_LDFLAGS"]=""
S["spml_ucx_CPPFLAGS"]=""
S["MCA_ompi_pml_STATIC_LTLIBS"]="mca/pml/v/libmca_pml_v.la mca/pml/ucx/libmca_pml_ucx.la mca/pml/ob1/libmca_pml_ob1.la mca/pml/cm/libmca_pml_cm.la "
S["MCA_BUILD_ompi_pml_ucx_DSO_FALSE"]=""
S["MCA_BUILD_ompi_pml_ucx_DSO_TRUE"]="#"
S["pml_ucx_LIBS"]="-lucp -luct -lucs -lucm "
S["pml_ucx_LDFLAGS"]=""
S["pml_ucx_CPPFLAGS"]=""

I am on a PowerPC machine with RedHat 8.4

BTW, even now that I can build ompi, I have weird Segfaults. I'll post them in the next comment.

amirsojoodi avatar Feb 02 '23 21:02 amirsojoodi

$ mpirun --mca pml ucx --mca btl ^smcuda,vader,openib,uct \
    -x UCX_TLS=rc,sm,cuda_copy,gdr_copy,cuda_ipc \
    -np 2 $BUILD_DIR/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw \
    --window-size 1 -m 2097152:67108864 H H
    
[mist-login01:4176932:0:4176932] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11)
==== backtrace (tid:4176932) ====
=================================
--------------------------------------------------------------------------
prterun noticed that process rank 1 with PID 0 on node mist-login01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

The error doesn't change with CUDA benchmarks. It seems there is a problem in MPI_Init. I tried MPI_THREAD_SINGLE, too, but no luck. Also, I rebuilt ucx without --enable-mt, no luck again.

However, changing pml from ucx to ob1 somehow works.

mpirun --mca pml ob1 --mca btl '^vader,tcp,openib,uct' -np 2 \
  /project/q/queenspp/sojoodi/OpenMPI-Release/build/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_bw \
  --window-size 1 -m 2097152:67108864 H H

amirsojoodi avatar Feb 02 '23 21:02 amirsojoodi

@amirsojoodi i've tried on Centos 8.4 and it works fine for me. Maybe there is an older version of UCX installed on your system? Can you please upload (or email me) config.log and config-release.out files from OMPI?

yosefe avatar Feb 03 '23 15:02 yosefe

@yosefe Sorry for the late reply Yossi.

I finally got it to work with UCX. I had to specifically disable a bunch of modules. I don't exactly know which fixed the issue, but I'll provide the commands here, in case:

  • UCX:
git clone --recursive https://github.com/openucx/ucx.git
cd ucx
git checkout v1.13.0
git submodule update --init --recursive

./autogen.sh 2>&1 | tee autogen.out

./configure --prefix=$BUILD_DIR \
  --with-cuda=$CUDA_HOME \
  --disable-assertions \
  --disable-debug \
  --disable-params-check \
  --without-knem \
  --without-xpmem \
  --without-ofi \
  --with-mlx5-dv \
  --enable-logging \
  --enable-compiler-opt=3 2>&1 | tee config-release.out

make -j32 all 2>&1 | tee make-release.out
make -j32 install 2>&1 | tee install-release.out
  • Ompi :
git clone --recursive https://github.com/open-mpi/ompi.git
cd ompi
git checkout v5.0.0rc9
git submodule update --init --recursive

perl autogen.pl -j 32 2>&1 | tee autogen.out

./configure --prefix=$BUILD_DIR \
  --disable-io-romio \
  --disable-io-ompio \
  --disable-mpi-fortran \
  --disable-oshmem \
  --enable-mca-no-build=btl-uct,btl-portals4,btl-ofi \
  --without-ofi \
  --without-portals4 \
  --without-ugni \
  --without-knem \
  --with-cuda=$CUDA_HOME \
  --with-cuda-libdir=$CUDA_COMPAT_PATH \
  --with-devel-headers \
  --with-hwloc=internal \
  --with-libevent=internal \
  --with-pmix=internal \
  --with-prrte=internal \
  --with-ucx=$BUILD_DIR \
  --with-ucx-libdir=$BUILD_DIR/lib 2>&1 | tee config-release.out

make -j32 all 2>&1 | tee make-release.out
make -j32 install 2>&1 | tee install-release.out

I am really tired of this right now, if I get a chance to figure out which one exactly fixed the issue, I'll post another follow-up comment/issue. Thanks for the help @yosefe and @jsquyres

amirsojoodi avatar Feb 07 '23 01:02 amirsojoodi

@yosefe : For an update, I used UCX v1.12.1, and the previous configs were just working fine. I had to disable hcoll during runtime, but other than that, everything was fine, CUDA/Host pt2pt/coll.

Updating UCX from 1.12.1 to 1.13.1 or newer just caused this weird error (similar to the previous one):

[mist-login01:3653042:0:3653042] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11)
==== backtrace (tid:3653042) ====
=================================
[mist-login01:3653043:0:3653043] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x11)
==== backtrace (tid:3653043) ====
=================================
--------------------------------------------------------------------------
prterun noticed that process rank 0 with PID 0 on node mist-login01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Maybe GCC (10.3.0) or CUDA (11.2.2) version mismatch... No idea. Anyway, I don't know if I should close this issue or not, so I leave it open.

amirsojoodi avatar Feb 08 '23 03:02 amirsojoodi

@amirsojoodi when you updated UCX did you also rebuild OpenMPI on top of it?

yosefe avatar Feb 08 '23 10:02 yosefe

@amirsojoodi when you updated UCX did you also rebuild OpenMPI on top of it?

Yes I did.

amirsojoodi avatar Feb 08 '23 15:02 amirsojoodi