flux-core
                                
                                 flux-core copied to clipboard
                                
                                    flux-core copied to clipboard
                            
                            
                            
                        mvapich2-tce application build fails when slurm is not installed
On fluke, where only Flux is installed, trying to build a simple mpi hello world program fails with:
[garlick@fluke108:mpi-test]$ module list
Currently Loaded Modules:
  1) intel-classic-tce/2021.6.0   2) mvapich2-tce/2.3.6   3) StdEnv (S)
  Where:
   S:  Module is Sticky, requires --force to unload or purge
[garlick@fluke108:mpi-test]$ make
mpicc     hello.c   -o hello
ld: warning: libpmi2.so.0, needed by /usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so, not found (try using -rpath or -rpath-link)
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Finalize'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_KVS_Fence'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Abort'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Nameserv_unpublish'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Initialized'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_GetNodeAttr'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Job_Spawn'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Init'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_KVS_Put'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Nameserv_lookup'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Nameserv_publish'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_GetJobAttrIntArray'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_PutNodeAttr'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_GetJobAttr'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Job_GetId'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_Info_GetNodeAttrIntArray'
/usr/tce/packages/mvapich2/mvapich2-2.3.6-intel-2021.6.0/lib/libmpi.so: undefined reference to `PMI2_KVS_Get'
make: *** [<builtin>: hello] Error 1
[garlick@fluke108:mpi-test]$ 
Edit: see also https://lc.llnl.gov/jira/browse/TCE-29 (not public)
Well this works:
[garlick@fluke108:mpi-test]$ make
mpicc    -c -o hello.o hello.c
mpicc -o hello hello.o -L/usr/lib64/flux -lpmi2
and then when flux runs the executable, it sets LD_LIBRARY_PATH appropriately.
How does slurm handle libpmi2, is it just always installed in /usr/lib64? I'm not sure how we usually do this, but my first thought would be to treat it as an alternative, in the update-alternatives sense, and symlink one or the other into place depending on which is set up.
How does slurm handle libpmi2, is it just always installed in /usr/lib64?
Yes.
my first thought would be to treat it as an alternative, in the update-alternatives sense, and symlink one or the other into place depending on which is set up.
Not a bad thought! I was thinking we could package a symlink in an RPM that is optionally installed on flux-only clusters. Sysadmins could also maintain the symlink with ansible, which is closer to the alternatives approach.
I think like mpich, mvapich does not need to link directly with this library. In fact it should have the PMI 1 wire protocol built in so should not even need to dlopen any PMI dso.
IOW a mvapich2 config issue.
As noted in the jira ticket mentioned above (not public), the following config options result in an mvapich that works on a flux only system and on a system with both slurm and flux installed
module --force purge
./configure \
  --enable-shared \
  --enable-romio \
  --disable-silent-rules \
  --disable-new-dtags \
  --enable-threads=multiple \
  --with-ch3-rank-bits=32 \
  --enable-wrapper-rpath=yes \
  --disable-alloc \
  --enable-fast=all \
  --disable-cuda \
  --enable-registration-cache \
  --with-device=ch3:mrail \
  --with-rdma=gen2 \
  --disable-mcast \
  --with-file-system=lustre+nfs+ufs \
  --enable-llnl-site-specific-options \
  --enable-debuginfo \
  --with-pm=hydra \
  --prefix=/g/g0/garlick/opt/mvapich2-2.3.7-1-hydra
#  --enable-fortran=all \
#  --with-pmi=pmi2 --with-pm=slurm --with-slurm=/usr
I’m guessing the fortran flag is there because of the Fortran error,
if so, this gets around it without disabling fortran:
FFLAGS='-fallow-argument-mismatch’.
On 10 Aug 2022, at 14:39, Jim Garlick wrote:
As noted in the jira ticket mentioned above (not public), the following config options result in an mvapich that works on a flux only system and on a system with both slurm and flux installed
module --force purge ./configure \ --enable-shared \ --enable-romio \ --disable-silent-rules \ --disable-new-dtags \ --enable-threads=multiple \ --with-ch3-rank-bits=32 \ --enable-wrapper-rpath=yes \ --disable-alloc \ --enable-fast=all \ --disable-cuda \ --enable-registration-cache \ --with-device=ch3:mrail \ --with-rdma=gen2 \ --disable-mcast \ --with-file-system=lustre+nfs+ufs \ --enable-llnl-site-specific-options \ --enable-debuginfo \ --with-pm=hydra \ --prefix=/g/g0/garlick/opt/mvapich2-2.3.7-1-hydra # --enable-fortran=all \ # --with-pmi=pmi2 --with-pm=slurm --with-slurm=/usr-- Reply to this email directly or view it on GitHub: https://urldefense.us/v3/https://github.com/flux-framework/flux-core/issues/4455*issuecomment-1211306193;Iw!!G2kpM7uM-TzIFchu!l7t5cgjPZsQCXQPeE_9mSnzpjrSFvsZoOmdOownyCdLTRvd31aO83sayCwcRa-Gsfg$ You are receiving this because you commented.
Message ID: @.***>
I actually just omitted fortran to save time on the build since I wasn't going to test it. So I don't know whether I would have hit that or not. I'll go ahead and try.
Adding --with-pm=hydra and not adding --with-pm=slurm resolved this problem.