ompi
ompi copied to clipboard
openmpi-4.1.6: ldflags in pkg-config incompatible with CMake's pkg_check_modules(... IMPORTED_TARGET ...)
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v4.1.6
❯ nix eval github:NixOS/nixpkgs/91a00709aebb3602f172a0bf47ba1ef013e34835#openmpi.version
"4.1.6"
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
nix build github:NixOS/nixpkgs/91a00709aebb3602f172a0bf47ba1ef013e34835#openmpi
Details of the problem
I'm not sure if this has been addressed on master yet (I don't think it has), but in v4.1.6 the generated .pc files' Libs contain a number of occurrences of the pattern -Wl,-rpath -Wl,... where the argument to -rpath is passed on as a separate token. CMake only keeps the first -Wl,-rpath. This issue can be alleviated by passing the the paths as -Wl,-rpath,${libfabricPath}
❯ nix build github:NixOS/nixpkgs/91a00709aebb3602f172a0bf47ba1ef013e34835#openmpi
❯ cat result/lib/pkgconfig/ompi-c.pc
...
Libs: ... -Wl,-rpath -Wl,${libdir} -Wl,-rpath -Wl,/nix/store/iyzz6m4dr6grn10qi67a1g1lp3wrnk7j-libfabric-1.20.0-dev/lib ...
Cf. also https://discourse.cmake.org/t/issues-with-opmi-cxx-pc-s-ldflags-and-pkg-check-modules-imported-target/9651
Thanks
Are you asking about v4.1.2 or v4.1.6? Your text mentions both.
If you're asking about v4.1.2, can you test with v4.1.6? Many things have been fixed since v4.1.2.
Sorry, that slipped through. The logs I had attached were for v4.1.6 and I just updated the description accordingly. To be clear, this isn't exactly an openmpi's issue (more like CMake not being able to handle all the cases of the functionality it claims to provide) per se, but in practice it's desirable for openmpi to adjust
Is it universally supported that
-Wl,-rpath -Wl,/some/dir
is equivalent to
-Wl,-rpath,/some/dir
?
I ask because I don't think we've ever used the form -Wl,-rpath,/some/dir. I don't know if there's a reason for that, or if we've used the other form simply because we've always used the other form.
Alternatively, is there a reason Cmake only uses the first token? Why doesn't it use all of Libs?
I ask because I don't think we've ever used the form -Wl,-rpath,/some/dir
I don't know about that but I see some matches: https://github.com/open-mpi/ompi/blob/bd33b994e1d09ab71e9bb0b66c9661623fe742a3/.ci/community-jenkins/pr-builder.sh#L192
Alternatively, is there a reason Cmake only uses the first token? Why doesn't it use all of Libs?
I'm not really familiar with CMake internals, but I can refer you to this reply: https://discourse.cmake.org/t/issues-with-opmi-cxx-pc-s-ldflags-and-pkg-check-modules-imported-target/9651/2?u=someone
Is it universally supported that
I think so but it'll take some research to confirm. Looking at man 1 ld for the GNU ld, there is a warning about prefixing args with -Wl when invoking the linker "indirectly", and about whitespaces. Should also be equivalent to -Wl,-rpath=...?
This is important, because otherwise the compiler driver program may silently drop the linker options, resulting in a bad link. Confusion may also arise when passing options that require values through a driver, as the use of a space between option and argument acts as a separator, and causes the driver to pass only the option to the linker and the argument to the compiler. In this case, it is simplest to use the joined forms of both single- and multiple-letter options, such as:
gcc foo.o bar.o -Wl,-eENTRY -Wl,-Map=a.map
pr-builder.sh
Ah, that's just a CI script -- we don't ship that. That's only used in a specific set of circumstances (vs. configure, which has to work everywhere).
Cmake only using the first token
From their reply, it looks like cmake isn't using just the first token, it does a de-duplication of all the tokens in Libs, and therefore all but one instance of -Wl,-rpath get removed. This then leads to Badness.
ld(1)
Ok. Will need to do some research here; the -Wl,-rpath -Wl,/some/dir form has worked for many years, and therefore supports both very old and modern systems. What we'd need to check is if -Wl,-rpath=/some/dir and/or -Wl,-rpath,/some/dir forms work on the oldest systems that we still care about.
I suppose a reasonable proxy might be to try on a CentOS 7 box (which is just about to go EOL in a few months), and possibly also an Ubuntu 18 box (which is already EOL, but if 2 alternate forms work there, then they probably work on all platforms that we still care about).
From their reply, it looks like cmake isn't using just the first token, it does a de-duplication of all the tokens in Libs, and therefore all but one instance of -Wl,-rpath get removed
Yes, I got that much as well
Ah, that's just a CI script -- we don't ship that. ... Will need to do some research here;
Maybe these would speed up the process, e.g. I see that the Darwin branches in contrib files come with some extra comments
❯ ag --hidden -- '-Wl,-rpath[,=]'
contrib/platform/lanl/darwin/mic-common
33:MIC_LDFLAGS="-L$XCOMPOSER/compiler/lib/mic -Wl,-rpath=$XCOMPOSER/compiler/lib/mic"
contrib/dist/linux/buildrpm.sh
260: configure_options="$configure_options --with-libfabric=$libfabric_path \"LDFLAGS=-Wl,--build-id -Wl,-rpath,$libfabric_path/lib64 -Wl,--enable-new-dtags\""
265: configure_options="$configure_options --with-libfabric=$libfabric_path \"LDFLAGS=-Wl,--build-id -Wl,-rpath,$libfabric_path/lib -Wl,--enable-new-dtags\""
3rd-party/romio341/confdb/aclocal_shl.m4
117: # As of 10.5, -Wl,-rpath,dirname should work . The dirname
119: # -Wl,-rpath,path for each of the paths in the list). However, os x
148: C_LINKPATH_SHL="-Wl,-rpath,"
257: C_LINKPATH_SHL="-Wl,-rpath,"
3rd-party/romio341/mpl/confdb/aclocal_shl.m4
117: # As of 10.5, -Wl,-rpath,dirname should work . The dirname
119: # -Wl,-rpath,path for each of the paths in the list). However, os x
148: C_LINKPATH_SHL="-Wl,-rpath,"
257: C_LINKPATH_SHL="-Wl,-rpath,"
.ci/community-jenkins/pr-builder.sh
192: CONFIGURE_ARGS="$CONFIGURE_ARGS LDFLAGS=-Wl,-rpath,/usr/local/lib/gcc5 --with-wrapper-ldflags=-Wl,-rpath,/usr/local/lib/gcc5"
also ag --hidden -- '-Wl,[a-zA-Z0-9-]+[,=]', and the github code search (matches seem to include pytorch, tf, and the kernel, which is promising; matches for -Wl,-rpath= include cpython, golang and git)
Most of what you found via ag in the OMPI tree are secondary scripts that are only run in specific cases. That being said, I found those romio scripts as well; I'd have to check the context there, but it's possible that that implies it is doing -Wl,-rpath,/path in the general case -- i.e., it fits the "it runs everywhere" criteria.
The data points for -Wl,-rpath=/blah is also quite promising; thanks!
Ugh. I'm remembering the issue here. Open MPI is not supplying those -Wl,-rpath -Wl,/path arguments -- we're directly extracting them from the Libtool that was generated for this platform.
Honestly, this is a CMake bug. Using -Wl,-rpath -Wl,/path in a .pc file's Libs entry is perfectly valid. If CMake wants to deduplicate Libs, it should do it carefully / correctly -- or not at all. Asking systems that are 2 layers down from CMake to fix the issue is not really the right solution.
Honestly, this is a CMake bug.
Absolutely, that's the message I started from 😅. I'll open a ticket in https://gitlab.kitware.com/cmake/cmake/-/issues as well as soon as I can catch a break
That said, I think playing well with CMake is fairly important even when CMake is wrong. Depending on how hard it is to rewrite libtool's flags it might be worth consideration...
This is getting absurd: there are two main MPI implementations (MPICH and Open MPI) and the other ones are likely derivatives, so if CMake is serious about MPI, they should support both without asking us to change anything.
Bonus point, fixing CMake will work with existing Open MPI implementations, not only the future ones.
Hi! This isn't about how CMake supports MPI (e.g. this is unrelated to their FindMPI.cmake, tracked separately), this is about CMake's generic pkg-config support. It's just another consumer for the flags in .pc files same as gcc or clang, and this one currently happens to not understand the space-separated linker flags (we have not found consumers who do not support the comma- or =-joined arguments yet). As I said, I'm going to file an issue upstream, only I expected that it'll take longer to process and solve (it's likely easier to write things in a simpler format, than to extend the generic parser to support one more special case)
If ompi isn't interested in supporting the (specific invocation of) pkg_check_modules that's a valid call. Shall I close the issue?
Fair enough. Can you cite where you (re?)raised the issue with CMake? I can chime in over there.
Let's leave this issue open for now; I'm disinclined to put a fix here, but I am not saying "no, absolutely not" yet. 😄
Tracking in https://gitlab.kitware.com/cmake/cmake/-/issues/25587 now!
Also the FindMPI issue I referred to earlier: https://gitlab.kitware.com/cmake/cmake/-/issues/25529
Interpreting flags is akin to tea leaf reading without being the linker/frontend itself and is why CMake has avoided it where possible. Alas, we get into these problems. I have a proposed strategy on the issue.
@mathstuf I don't disagree. 😄 We have gotten tagged with similar things over the years here in Open MPI -- we had to selectively understand a few flags so that our configure script and wrapper compilers don't do the Wrong Things. I think we had a similar journey to yours:
- we're neutral; we shouldn't understand any of the compiler / linker flags that the user is passing through us to the lower layer
- ...user reports a bug where we're just doing our own manipulations of the flags without really understanding them, and that ends up in Open MPI doing a Wrong Thing...
- ah shoot; I guess we'll have to add a few cases to grok a few specific flags so that we don't do the Wrong Things with them
- ok, it looks like those few cases seem to handle the actually-used real-world cases; we don't need to implement a full grok-all-the-flags kind of scheme (which would effectively be impossible, anyway)