ovis
ovis copied to clipboard
__ldms_zap_get hardcodes plugin names, breaks fabric transport
plugins other than rdma,sock,ugni are disallowed in __ldms_zap_get This prevents the fabric transport plugin for omnipath and the test plugin from being used, it would seem.
@tom95858 I would like to be testing the fabric xprt plugin this week on stria and omnipath. Does ogc have a fix for this issue, or should i make one?
I believe that @narategithub has a branch that has this fix in it.
The change is part of https://github.com/ovis-hpc/ovis/commit/e74153b165917bd8abfc3bb6d34deefafd6d10c4 in v5 (master). Let me port it overs so that fabric
transport has a way to specify provider and domain.
@baallan I'd just pushed v4-ldms-fabric
to my repo on gitlab (https://gitlab.opengridcomputing.com/narate/ovis/-/commits/v4-ldms-fabric). Could you please give it a try? Thx
@narategithub I mailed you dumps from a crash in 1st connection attempt.
Thanks, I'll take a look.
@baallan I'd just added a fix to the buffer overrun problem you've found && rebase on top of OVIS-4 && (forced) pushed to my repo on gitlab on the same branch (v4-ldms-fabric). Could you please give it another try?
will do
sent email with more recent dumps. still trying to get gdb at it.
sent email with gdb dump
@narategithub anything new for me to test yet in omnipath?
@baallan
Unfortunately, no updates yet. However, I had just thought of something. I'm guessing that your test uses '-x' option, and not listen
, command. So, could you please give a listen
config on a specific IP address as follows:
listen xprt=fabric port=411 host=<omni-path-IP-addr>
load name=meminfo
...
And, use `ldms_ls -x fabric -p 411 -h <omni-path-IP-addr>.
I tried the IP-addr specific listen/connect with libfabric on iWarp (verbs provider) and eth (socket provider) and they worked for me.
The next step would be setup a zoom session for a live debug?
Hi @baallan, @narategithub, @oceandlr: What is the status of this? Is it possible for us to host a Zoom session to get this resolved?
We had a session today and narate has a new theory to work on.
@tom95858 You should leave me out of a session if that's the way to go. I have no technical insights on this one.
@tom95858
Ben tested my branch narate/v4-ldms-fabric
. In short, ldms over fabric works when listen on a specific address (using `listen xprt=fabric port=BLA host=OMNIPATH_IP_ADDR).
During the session, we noticed a couple of nits, and Ben and I agree to add the following:
-
Modify
-x
to receive "XPRT:PORT:ADDR", where the ":ADDR" is optional, to specify the address to listen to on the CLI option and still backward compatible. -
Fix zap_fabric nits to make it report (with INFO level) the fabric provider it is using. 2.1) In doing so, I found out that the current zap_fabric in my v4 branch only support one fabric. I'll bring the multiple fabric part from v5 into it and test tonight.
I'll post following pull requests (by topics) tonight:
i) making ldms xprt not fixed to known list
ii) ldmsd listen with specific address
iii) zap_fabric changes to support multiple fabrics (remove the global g.fabric
in zap_fabric).
for transport:port:addr tuples, it should also accept transport:port:host and do the local ip addr lookup. I wonder if there's something we should be making consistent with how sosdb parses transport connection tuples from user input.
Hi @baallan, @narategithub:
WRT adding the optional transport:port:host It should work one way and not require the logic to try and determine if the 2nd element a port or a host. My view is to keep it simple.
Also, I'm not convinced we've really worked this out completely. If the user specifies fabric, then we don't want the socket transport, which I think is what is happening here, i.e. libfabric is picking the IP interface instead of the omnipath one. The reason why your recommended approach works is that the specification of the host address is forcing the libfabrics logic to pick the endpoint that corresponds to that IP address which turns out to be omnipath. I don't think a user who selects 'fabric' is expecting or wants it to ultimately run over an IP interface.
I recommend that we dig into this a bit deeper and understand how interfaces are selected with libfabric. It is certainly the case that 0.0.0.0 will cause problems.
re the xprt:port:host. We thought to keep it simple (not consuming another of the scare one-letter options on the command line). This may not be optimal if we want to support cases (such as login nodes) where one wants to support listening:
- sock on some ip addresses/hostnames and not others
- rdma on omnipath but not mlx or the other way around.
- rdma on a particular port (which generally will have a specific host name) and not another port. Maybe we just have to have another option.
In the particular case of omnipath, a given hca/hfi ever only has a single port (no dual port cards exist) but it's conceivable a node might have 2 separate cards. I don't know how any HSN library/utility can make any automatic choice other than "first port seen of a given type", which is unlikely to be a universal solution. For our omnipath systems in particular, we have nodes that are both omnipath and mellanox connected, so we need to be able to specify a daemon correctly. (I don't currently anticipate wanting a daemon that listens on both omnipath and mlx simultaneously, but I may need to choose).
Recently merged addition of host field on -x option enables data transport all the way to the store and ldms_ls to work. I'm reviewing a long list of valgrind messages, but this may be due to rdma buffers being seen as uninitialized.
@narategithub looks like destroy for fabric connections in ldms_ls and agg both either isn't called or fails to call fi_freeinfo somehow, resulting in a large apparent per-connection leak. I'll send valgrind logs to you.
@narategithub @tom95858 is this issue resolved? The original bug (hardcoded plugin names) has been fixed, and my impression on libfabric is that for now it's off the table as we try to get omnipath/rdma working right.