ovis
ovis copied to clipboard
Failures connecting to ldmsd w OVIS-4.2.1-rc1
I'm seeing intermittent failures while connecting to LDMS daemons in LDMS 4.2.1-rc1.
-
The aggregators start showing messages like
Error 5 in lookup callback for set 'nid07421/jobinfo'
We've had about 1000000 of these messages in the past 24 hrs. -
There is also a regular stream of
Producer agg1107.nid07737 rejected the connection (ugni nid07737:411)
messages. Restarting the aggregator clears these. -
Running updtr_status on the aggregator nodes shows roughly half of the nodes on each aggregator in a CONNECTED state and half in a DISCONNECTED state. The number of connected and disconnected nodes is not static, but increases and decreases over time.
-
Connections to the aggregator nodes via ldmsctl and ldms_ls fail randomly a few times per minute.
-
I'm experiencing trouble connecting via ldms_ls from some nodes. This problem is consistent, i.e. it happens every time I try to connect.
boot-cori:~ # ssh mom2 ZAP_UGNI_COOKIE=0x876543 ldms_ls -h mom4 -x ugni -p 412 -a munge | head -2
nid13054/vmstat
nid13054/procstat
boot-cori:~ # ssh mom1 ZAP_UGNI_COOKIE=0x876543 ldms_ls -h mom4 -x ugni -p 412 -a munge | head -2
Warning: Unable to initialize DLA, GNI_RC_ERROR_RESOURCE at line 506 in file cdm.c
zap_ugni: ERROR: GNI_CdmAttach() failed: GNI_RC_ERROR_RESOURCE
ldms: Cannot get zap plugin: ugni
Error creating transport.
This started after I restarted one of the daemons while I was polling it via the command line clients. The node running the clients is no longer able to connect to ldms.
Here's some information that might be helpful:
- A uGNI "connection" as seen by the peers is over a socket. There is no uGNI resource that is allocated at connect time. A failure to connect is typically either no one is listening, an authentication failure or the process ran out of file descriptors. You might check /proc/
/fd and see how many files are there vs. your ulimit -n. - That said, "rejected" implies an authentication failure. You might might check the auth configuration on that node.
- The GNI_CdmAttach occurs when the uGNI plugin is loaded. This only happens once for a process. Usually it's because you don't have permission, or the cookie is wrong. Also, try adding ZAP_UGNI_PTAG=0 along with the cookie. If it is inadvertently set to something !0 in the environment, it can confuse the transport into thinking it's Gemini instead of Aries
- The lookup callback error is occurring because the RDMA_READ of the metric set meta data is failing. It would be interesting to know what the uGNI error actually was. I will look at the code
@eric-roman @tom95858 Is this still relevant in 4.3.3?
It's not clear yet. We're going to deploy 4.3.3 at scale in a few weeks.
It is a limitation of the uGNI transport when used from a slurm plugin. The setup of the uGNI transport from the slurm plugin collides with the application setup of the same causing both to fail. This should be documented as a limitation of the spank plugin.
What's the relationship to the slurm plugin? This issue occurs in 4.2.1, which lacks the newer slurm plugin, doesn't use the ugni transport in the jobinfo spank plugin, and the jobinfo plugin wasn't used at this site at the time the issue was opened.