sst-elements icon indicating copy to clipboard operation
sst-elements copied to clipboard

External Element Fails to Load with 'mpirun'

Open mrutt92 opened this issue 1 year ago • 8 comments

I have an external element that uses memHierarcy, and subclasses a couple components from memHierarcy actually. It works fine in single-node simulation, but it seems to have issues when I try to mpirun.

So for example, if I do this:

mpirun -n 2 test-for-my-element.py

I get an error like this:

FATAL: [1:0] SST Core: can't find requested component 'MyElement.MyComponent'                                                                          
Error: unable to find "MyElement" element library                                                                                                  
SST-DL: Loading failed for /AGILE/MyElement/element/libMyElement.so, error: /AGILE/MyElement/element/libMyElement.so: undefined symbol: _ZTIN3SST12MemHierarchy12SimpleMemoryE 

I am running this on Centos 7. Boost version 1.82.0, MPI = mpich v3.3.3.1

sst-core: 756fee8a sst-elements: branched from 470cc64ac

Has anyone seen anything like this and has advice on how to resolve?

mrutt92 avatar Jul 28 '23 17:07 mrutt92

OK. I realized I wasn't linking my element .so with -lmemHierarcy with rpath and -L set properly. But now sst has trouble finding merlin components (which I just use in the script, I do not extend any merlin components).

I see this when I run with 'mpirun -n 4 sst test-for-my-element.py`:

FATAL: [3:0] SST Core: can't find requested component 'merlin.Bridge'

For some reason if I run with mpirun -n 2 I don't see an issue.

mrutt92 avatar Jul 28 '23 18:07 mrutt92

So... I am using memNetBridge component which uses merlin. I'm looking at Makefile.am and I don't see any flags explicitly linking memHierarcy with merlin.

mrutt92 avatar Jul 28 '23 19:07 mrutt92

Update, I tried using OpenMPI v4.1.5 instead of MPICH and I'm seeing the same issue.

mrutt92 avatar Jul 28 '23 20:07 mrutt92

The Bridge class in Merlin is a header only component, so there is no need for memHierarchy to link in Merlin. We’ll take a closer look at the bridge components and see what we can find.

One issue may be that the Bridge component isn’t ending up in the Merlin element library because it is header only and is just in the memHierarchy library (due to the inherited class in memH). In the 4 rank run, does rank 3 have any memHierarchy components in it? You can get the partitioning by running with —output-partition=file_name. It may also fail if the Bridge component is referenced in that rank before a memHierarchy component because the memH library wouldn’t be loaded yet. I’ll take a closer look at what ends up in libmerlin, but I suspect this is the most likely issue.

feldergast avatar Aug 02 '23 19:08 feldergast

I have confirmed that the Bridge object is not in libmerlin.so and that it is in libmemHierarchy.so. This is possible because ELI (element library info) is structured such that the element says what library it's in since there's no way to know by default what library it belongs to. So, as long as the memHierarchy library loads before the merlin.Bridge component is created things will work. In practice this means that if on any given rank a memH component is constructed before the merlin.Bridge object, it will work, otherwise it will fail. I'll need to give some thought to what the best way to fix this is.

feldergast avatar Aug 02 '23 21:08 feldergast

The right solution would seem to be moving the implementation of merlin.Bridge into a .cc file in Merlin so that the element exists in the merlin library. This should fix your problem. We'll put in a fix for that.

feldergast avatar Aug 03 '23 20:08 feldergast

@feldergast @mrutt92 Is this issue fixed/can we close it?

gvoskuilen avatar Oct 16 '23 21:10 gvoskuilen

This should have been fixed with PR #2197, @mrutt92 can you confirm if this fixed your issue?

feldergast avatar Oct 16 '23 22:10 feldergast