flux-core icon indicating copy to clipboard operation
flux-core copied to clipboard

mvapich2-tce application build fails when slurm is not installed

Open garlick opened this issue 3 years ago • 7 comments

On fluke, where only Flux is installed, trying to build a simple mpi hello world program fails with:


[garlick@fluke108:mpi-test]$ module list

Currently Loaded Modules:
  1) intel-classic-tce/2021.6.0   2) mvapich2-tce/2.3.6   3) StdEnv (S)

  Where:
   S:  Module is Sticky, requires --force to unload or purge

[garlick@fluke108:mpi-test]$ make
mpicc     hello.c   -o hello
ld: warning: libpmi2.so.0, needed by /usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so, not found (try using -rpath or -rpath-link)
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Finalize'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_KVS_Fence'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Abort'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Nameserv_unpublish'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Initialized'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_GetNodeAttr'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Job_Spawn'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Init'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_KVS_Put'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Nameserv_lookup'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Nameserv_publish'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_GetJobAttrIntArray'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_PutNodeAttr'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_GetJobAttr'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Job_GetId'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_GetNodeAttrIntArray'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_KVS_Get'
make: *** [<builtin>: hello] Error 1
[garlick@fluke108:mpi-test]$ 

Edit: see also https://lc.llnl.gov/jira/browse/TCE-29 (not public)

garlick avatar Aug 01 '22 19:08 garlick

Well this works:

[garlick@fluke108:mpi-test]$ make
mpicc    -c -o hello.o hello.c
mpicc -o hello hello.o -L/usr/lib64/flux -lpmi2

and then when flux runs the executable, it sets LD_LIBRARY_PATH appropriately.

garlick avatar Aug 01 '22 20:08 garlick

How does slurm handle libpmi2, is it just always installed in /usr/lib64? I'm not sure how we usually do this, but my first thought would be to treat it as an alternative, in the update-alternatives sense, and symlink one or the other into place depending on which is set up.

trws avatar Aug 04 '22 17:08 trws

How does slurm handle libpmi2, is it just always installed in /usr/lib64?

Yes.

my first thought would be to treat it as an alternative, in the update-alternatives sense, and symlink one or the other into place depending on which is set up.

Not a bad thought! I was thinking we could package a symlink in an RPM that is optionally installed on flux-only clusters. Sysadmins could also maintain the symlink with ansible, which is closer to the alternatives approach.

grondo avatar Aug 04 '22 17:08 grondo

I think like mpich, mvapich does not need to link directly with this library. In fact it should have the PMI 1 wire protocol built in so should not even need to dlopen any PMI dso.

IOW a mvapich2 config issue.

garlick avatar Aug 04 '22 18:08 garlick

As noted in the jira ticket mentioned above (not public), the following config options result in an mvapich that works on a flux only system and on a system with both slurm and flux installed

module --force purge
./configure \
  --enable-shared \
  --enable-romio \
  --disable-silent-rules \
  --disable-new-dtags \
  --enable-threads=multiple \
  --with-ch3-rank-bits=32 \
  --enable-wrapper-rpath=yes \
  --disable-alloc \
  --enable-fast=all \
  --disable-cuda \
  --enable-registration-cache \
  --with-device=ch3:mrail \
  --with-rdma=gen2 \
  --disable-mcast \
  --with-file-system=lustre+nfs+ufs \
  --enable-llnl-site-specific-options \
  --enable-debuginfo \
  --with-pm=hydra \
  --prefix=/g/g0/garlick/opt/mvapich2-2.3.7-1-hydra
#  --enable-fortran=all \
#  --with-pmi=pmi2 --with-pm=slurm --with-slurm=/usr

garlick avatar Aug 10 '22 21:08 garlick

I’m guessing the fortran flag is there because of the Fortran error, if so, this gets around it without disabling fortran: FFLAGS='-fallow-argument-mismatch’.

On 10 Aug 2022, at 14:39, Jim Garlick wrote:

As noted in the jira ticket mentioned above (not public), the following config options result in an mvapich that works on a flux only system and on a system with both slurm and flux installed

module --force purge
./configure \
  --enable-shared \
  --enable-romio \
  --disable-silent-rules \
  --disable-new-dtags \
  --enable-threads=multiple \
  --with-ch3-rank-bits=32 \
  --enable-wrapper-rpath=yes \
  --disable-alloc \
  --enable-fast=all \
  --disable-cuda \
  --enable-registration-cache \
  --with-device=ch3:mrail \
  --with-rdma=gen2 \
  --disable-mcast \
  --with-file-system=lustre+nfs+ufs \
  --enable-llnl-site-specific-options \
  --enable-debuginfo \
  --with-pm=hydra \
  --prefix=/g/g0/garlick/opt/mvapich2-2.3.7-1-hydra
#  --enable-fortran=all \
#  --with-pmi=pmi2 --with-pm=slurm --with-slurm=/usr

-- Reply to this email directly or view it on GitHub: https://urldefense.us/v3/https://github.com/flux-framework/flux-core/issues/4455*issuecomment-1211306193;Iw!!G2kpM7uM-TzIFchu!l7t5cgjPZsQCXQPeE_9mSnzpjrSFvsZoOmdOownyCdLTRvd31aO83sayCwcRa-Gsfg$ You are receiving this because you commented.

Message ID: @.***>

trws avatar Aug 10 '22 21:08 trws

I actually just omitted fortran to save time on the build since I wasn't going to test it. So I don't know whether I would have hit that or not. I'll go ahead and try.

garlick avatar Aug 10 '22 21:08 garlick

Adding --with-pm=hydra and not adding --with-pm=slurm resolved this problem.

garlick avatar Aug 26 '22 13:08 garlick