ompi
ompi copied to clipboard
Cannot build common/ofi as DSO
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
5.0.0rc1
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Trying to compile from source tarball
If you are building/installing from a git clone, please copy-n-paste the output from git submodule status
.
Please describe the system on which you are running
- Operating system/version: Linux
- Computer hardware: N/A
- Network type: N/A
Details of the problem
Hi,
I am trying to cross-compile Open MPI and getting used to the new "default static" MCA builds. I am having issues trying to build all components depending on libfabric.so
(which we cannot assume to be functional on all systems where our Open MPI distrib should run) as DSOs, so I'm a doing:
./configure --enable-mca-dso=common-ofi,mtl-ofi --with-ofi=$HOME/.gradle/caches/cda/tpls/libfabric-1.13.1-linux-x86_64-Release-libc2_17-gnu9_2_0/
but end up with the following error:
[...]
make[2]: Leaving directory `/u/ydfb4q/tmp/ompi5-common-ofi-test/openmpi-5.0.0rc1/opal'
Making all in mca/common/ofi
make[2]: Entering directory `/u/ydfb4q/tmp/ompi5-common-ofi-test/openmpi-5.0.0rc1/opal/mca/common/ofi'
CC common_ofi.lo
LN_S libopen-palmca_common_ofi.la
CCLD libopen-palmca_common_ofi.la
make[2]: Leaving directory `/u/ydfb4q/tmp/ompi5-common-ofi-test/openmpi-5.0.0rc1/opal/mca/common/ofi'
Making all in tools/wrappers
make[2]: Entering directory `/u/ydfb4q/tmp/ompi5-common-ofi-test/openmpi-5.0.0rc1/opal/tools/wrappers'
CC opal_wrapper.o
GENERATE opal_wrapper.1
CCLD opal_wrapper
../../../opal/.libs/libopen-pal.so: undefined reference to `opal_common_ofi_is_in_list'
../../../opal/.libs/libopen-pal.so: undefined reference to `opal_mca_common_ofi_select_provider'
../../../opal/.libs/libopen-pal.so: undefined reference to `opal_common_ofi'
../../../opal/.libs/libopen-pal.so: undefined reference to `opal_common_ofi_mca_register'
../../../opal/.libs/libopen-pal.so: undefined reference to `opal_common_ofi_register_mca_variables'
../../../opal/.libs/libopen-pal.so: undefined reference to `opal_common_ofi_fini'
../../../opal/.libs/libopen-pal.so: undefined reference to `opal_common_ofi_init'
../../../opal/.libs/libopen-pal.so: undefined reference to `opal_common_ofi_mca_deregister'
collect2: error: ld returned 1 exit status
make[2]: *** [opal_wrapper] Error 1
make[2]: Leaving directory `/u/ydfb4q/tmp/ompi5-common-ofi-test/openmpi-5.0.0rc1/opal/tools/wrappers'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/u/ydfb4q/tmp/ompi5-common-ofi-test/openmpi-5.0.0rc1/opal'
make: *** [all-recursive] Error 1
Is there anything I'm doing wrong?
Thanks, Moritz
FYI @open-mpi/ofi.
Can you try the same build, but change --enable-mca-dso=common-ofi,mtl-ofi
to --enable-mca-dso=common-ofi,mtl-ofi,btl-ofi
?
We're probably not leaking the OFI BTL's dependency on libcommon_ofi.so properly; I'll try to take a look today. But I'm also guessing that you were explicitly trying to avoid libmpi.so having a link time dependency on libfabric.so, so you'll want to add the OFI BTL to the dso list anyway.
Specifying --enable-mca-dso=common-ofi,mtl-ofi,btl-ofi
works, thanks @bwbarrett!
I wasn't aware of the (new?) BTL/ofi component while I was composing the explicit DSO list based on the MCA parameters we had in Open MPI 4.
The OFI BTL was introduced in 4.1.0 as part of work to better support the MPI one-sided interface.
I went down the rabbit hole on trying to do something sane here. I think the right thing is to to set the library dependencies properly so that the btl-ofi (or similar components) depend on the common library explicitly when built into libmpi.so, but that means figuring out a build order of components that doesn't have a circular dependency between libmpi.so and the common library.
@bwbarrett Is this still an issue? We could document the workaround for v5.0.0 and fix post v5.0.0 if needed.
I think that the issue remains, yes.
i'll take a look at this.
I'm not sure how critical this is to fix for the 5.0.0 release since there is a workaround. I'm preferring a path where the ofi-mtl/ofi-btl configures fail if common-ofi is to be built as DSO and they are not.
The workaround is pretty non-intuitive and the error message is terrible. The core issue has been here since we added the first common library, but wasn't as impactful when the default was to build dsos, since the default usually worked out.
I had tried to figure out how to fix the issue, so that you could do what the user asked (build ofi-common as a dso but ofi-btl as part of libmpi). I think there's an unsolvable circular dependency between libmpi and libcommon-ofi because the btl needs symbols in common-of and common-ofi uses symbols in openpal.
Maybe instead of focusing on supporting this corner case, we should instead focus on raising an error message. The common components are always configured before the other frameworks, so the ofi btl (and mtl and...) could check if it is building as a dso and check BUILD_opal_common_ofi_DSO
to ensure it is 0
if the btl is not building a DSO, and pretty-print an error message in that case?
That’s what I’m implementing. There’s an mca macro to get the compile mode for a given component/type that is handy.
From: Brian Barrett @.> Reply-To: open-mpi/ompi @.> Date: Tuesday, August 23, 2022 at 3:58 PM To: open-mpi/ompi @.> Cc: "Pritchard Jr., Howard" @.>, Assign @.***> Subject: [EXTERNAL] Re: [open-mpi/ompi] Cannot build common/ofi as DSO (#9451)
The workaround is pretty non-intuitive and the error message is terrible. The core issue has been here since we added the first common library, but wasn't as impactful when the default was to build dsos, since the default usually worked out.
I had tried to figure out how to fix the issue, so that you could do what the user asked (build ofi-common as a dso but ofi-btl as part of libmpi). I think there's an unsolvable circular dependency between libmpi and libcommon-ofi because the btl needs symbols in common-of and common-ofi uses symbols in openpal.
Maybe instead of focusing on supporting this corner case, we should instead focus on raising an error message. The common components are always configured before the other frameworks, so the ofi btl (and mtl and...) could check if it is building as a dso and check BUILD_opal_common_ofi_DSO to ensure it is 0 if the btl is not building a DSO, and pretty-print an error message in that case?
— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https:/github.com/open-mpi/ompi/issues/9451*issuecomment-1224930534__;Iw!!Bt8fGhp8LhKGRg!G6NaMTDMR24tSvjCcAuJ2_qICU2O73aFZc_xGhFhpgiSWYcrUHd5OeRCTIFa1MxfV9fCr18WFPdQnV1RtZ2RZ7d6$, or unsubscribehttps://urldefense.com/v3/__https:/github.com/notifications/unsubscribe-auth/AB3U3KSGOKPXMGZES6XQ5VDV2VCQDANCNFSM5FEWEI6A__;!!Bt8fGhp8LhKGRg!G6NaMTDMR24tSvjCcAuJ2_qICU2O73aFZc_xGhFhpgiSWYcrUHd5OeRCTIFa1MxfV9fCr18WFPdQnV1Rtdmzwhfj$. You are receiving this because you were assigned.Message ID: @.***>
v5.0.x: https://github.com/open-mpi/ompi/pull/10784