ompi icon indicating copy to clipboard operation
ompi copied to clipboard

test/datatype: do not use installed components

Open ggouaillardet opened this issue 1 year ago • 10 comments

By default, the tests will use the components in the build directory but also in the installation directory if any. That can cause some bizarre issues when they are not compatible.

Thanks Kook Jin Noh for the initial bug report.

Refs. open-mpi/ompi#12282

ggouaillardet avatar Jan 28 '24 04:01 ggouaillardet

It took me a while to understand what this change is doing on the tests. We should document this change better.

Btw, how are the tests using the components in the build directory ? We load them using the install_dir, we don't even know the build path for each component. If we include everything into libmpi.so then this should not be a problem, but if we build shared library components we default to first opening the install path location components.

bosilca avatar Jan 29 '24 19:01 bosilca

Well, shame on me!

the tests do not use the components from the build directory, and with this patch, it does not open any component at all, which happens to be ok for test/datatype. It still tries to open the openmpi-mca-params.conf from the install directory. export OMPI_MCA_mca_base_param_files= causes mpirun to crash (likely a bug) but using a bogus value can do the trick.

I will double check that tomorrow and fix the commit message.

ggouaillardet avatar Jan 30 '24 15:01 ggouaillardet

the tests do not use the components from the build directory,

It's quite possible that we don't have any make check tests that use MCA components. I think most (all?) of our make check stuff was meant to be for the "core" library functionality -- at least back in the beginning of the project. That attitude may or may not be a good one to have all these years later, but given that we haven't invested much in our make check infrastructure, it's quite possible (likely) that it still reflects our attitudes from way back early in the project.

jsquyres avatar Jan 30 '24 19:01 jsquyres

Thanks @rhc54 for fixing the param file issue in openpmix/openpmix#3273, I will backport the fix to Open MPI.

ggouaillardet avatar Jan 31 '24 13:01 ggouaillardet

I am not completely convinced by this. Preventing the use of components from an old installation (especially one that is not binary compatible internally), is a desired features, but this way of doing it is a little too strict.

On a default build (all components included in the libmpi.so) this approach works by preventing OMPI from loading unnecessary/older components. But OMPI will be able to use all the static components that are part of the MPI shared library.

But in the case where we build with full support for shared libraries, aka. all components are in separate .so, setting an empty MCA component path, we will prevent OMPI from loading any components. Even the ones that are required (if, accelerator, to name just a few). As a result, many tests (or at least all datatype tests) will segfault, as they require at least the NULL accelerator component.

The issue #12282 talks about OMPI 5.x loading components from an OMPI 4.1 installation. I don't understand how that can happen because I was under the impression that we prevent the opening of components with an incorrect version.

bosilca avatar Feb 01 '24 00:02 bosilca

I just found that the v4 and v5 series use the same MCA version (2.1.0) ! I do not know whether this is intended or something we forgot.

The issue is triggered by the mca_pmix_ext3x.so component. It registers the library_version parameter, but mca_pmix_ext3x.so is dlclose()'d at the end before destructing the MCA parameters, and this is when the crash occurs.

My theory is that since PMIx is a first class citizen from the v5 series, no PMIx component is selected (nor expected), and hence the library_version parameter is not removed before dlclose() is invoked (also, I guess that should occur in the initialization phase instead of the finalization). A possible fix to that is to flag the pmix framework as "component-less" and do not register components (or do not even open them in the first place). FWIW, I made an ugly proof-of-concept and it worked.

You are definitely correct about static vs dynamic components.

In the example you described, I guess make check would always fail unless make install was previously invoked, so I would always build the NULL accelerator component statically. As a side effect, it would not invalidate the approach of this PR.

ggouaillardet avatar Feb 01 '24 07:02 ggouaillardet

The issue is triggered by the mca_pmix_ext3x.so component. It registers the library_version parameter, but mca_pmix_ext3x.so is dlclose()'d at the end before destructing the MCA parameters, and this is when the crash occurs.

Hmmm...I wonder if it would make sense to add a check in the ext3x component so it deregisters its params if dlopen of the PMIx library fails? In other words, check to see if it finds the PMIx lib and then deselect itself if not?

Won't help the OMPI v4 stuff that is out in the wild, but maybe help down the road?

FWIW: the MCA version didn't change because there was no modification to the MCA interface itself, so the plugins are in fact compatible.

rhc54 avatar Feb 01 '24 14:02 rhc54

I do not think the pmix/ext3x component (tries to) dlopen() the PMIx library.

From the point of view of Open MPI v5, the pmix/ext3x component is successfully loaded, but it never gets selected (because the pmix framework does not need components any more). And because to pmix component evers gets selected, it is never deselected until the very end.

ggouaillardet avatar Feb 01 '24 14:02 ggouaillardet

Given that fixing OMPI v4 won't help with all the outstanding installations, maybe the right (but perhaps ugly) solution is to add some in the OMPI v5 mca base that just says "if you find a PMIx component, ignore it". I believe that is the solution you prototyped, yes? Makes the best sense to me!

rhc54 avatar Feb 01 '24 14:02 rhc54

One of the root causes of this is that we dlclose a component where we successfully called _init, so we basically never give the component a chance to cleanup behind them. Calling _close will allow the the deregistration of MCA variables, and avoid the segfault.

However, other issues persist:

  1. Why did OMPI 5.l0 loaded a component for a framework that does not exists in this version ? As @ggouaillardet indicated above PMIX is now a base component, there is no reason to try to slurp in other PMIX like components.
  2. The current protection for loading MCA modules is too weak (checking only for component ABI does not guarantee that these components will find everything they need from the library). At least the check should be symmetric, i.e make sure the component was built for the right MPI library. Then the question become "how much we limit the validity of our components ?"

@ggouaillardet this approach will not solve the problem as non-yet-installed components. In fact the only solution to it would be to always build the accelerator_null component statically. We'll chat about this on our slack channel and come back with a solution.

bosilca avatar Feb 01 '24 15:02 bosilca