ovis icon indicating copy to clipboard operation
ovis copied to clipboard

Failures connecting to ldmsd w OVIS-4.2.1-rc1

Open eric-roman opened this issue 5 years ago • 5 comments

I'm seeing intermittent failures while connecting to LDMS daemons in LDMS 4.2.1-rc1.

  1. The aggregators start showing messages like Error 5 in lookup callback for set 'nid07421/jobinfo' We've had about 1000000 of these messages in the past 24 hrs.

  2. There is also a regular stream of Producer agg1107.nid07737 rejected the connection (ugni nid07737:411) messages. Restarting the aggregator clears these.

  3. Running updtr_status on the aggregator nodes shows roughly half of the nodes on each aggregator in a CONNECTED state and half in a DISCONNECTED state. The number of connected and disconnected nodes is not static, but increases and decreases over time.

  4. Connections to the aggregator nodes via ldmsctl and ldms_ls fail randomly a few times per minute.

  5. I'm experiencing trouble connecting via ldms_ls from some nodes. This problem is consistent, i.e. it happens every time I try to connect.

boot-cori:~ # ssh mom2 ZAP_UGNI_COOKIE=0x876543 ldms_ls -h mom4 -x ugni -p 412 -a munge | head -2
nid13054/vmstat
nid13054/procstat

boot-cori:~ # ssh mom1 ZAP_UGNI_COOKIE=0x876543 ldms_ls -h mom4 -x ugni -p 412 -a munge | head -2
Warning: Unable to initialize DLA, GNI_RC_ERROR_RESOURCE at line 506 in file cdm.c
zap_ugni: ERROR: GNI_CdmAttach() failed: GNI_RC_ERROR_RESOURCE
ldms: Cannot get zap plugin: ugni
Error creating transport.

This started after I restarted one of the daemons while I was polling it via the command line clients. The node running the clients is no longer able to connect to ldms.

eric-roman avatar Mar 15 '19 23:03 eric-roman

Here's some information that might be helpful:

  • A uGNI "connection" as seen by the peers is over a socket. There is no uGNI resource that is allocated at connect time. A failure to connect is typically either no one is listening, an authentication failure or the process ran out of file descriptors. You might check /proc//fd and see how many files are there vs. your ulimit -n.
  • That said, "rejected" implies an authentication failure. You might might check the auth configuration on that node.
  • The GNI_CdmAttach occurs when the uGNI plugin is loaded. This only happens once for a process. Usually it's because you don't have permission, or the cookie is wrong. Also, try adding ZAP_UGNI_PTAG=0 along with the cookie. If it is inadvertently set to something !0 in the environment, it can confuse the transport into thinking it's Gemini instead of Aries
  • The lookup callback error is occurring because the RDMA_READ of the metric set meta data is failing. It would be interesting to know what the uGNI error actually was. I will look at the code

tom95858 avatar Mar 16 '19 15:03 tom95858

@eric-roman @tom95858 Is this still relevant in 4.3.3?

oceandlr avatar Nov 17 '19 15:11 oceandlr

It's not clear yet. We're going to deploy 4.3.3 at scale in a few weeks.

eric-roman avatar Nov 18 '19 15:11 eric-roman

It is a limitation of the uGNI transport when used from a slurm plugin. The setup of the uGNI transport from the slurm plugin collides with the application setup of the same causing both to fail. This should be documented as a limitation of the spank plugin.

tom95858 avatar Nov 18 '19 18:11 tom95858

What's the relationship to the slurm plugin? This issue occurs in 4.2.1, which lacks the newer slurm plugin, doesn't use the ugni transport in the jobinfo spank plugin, and the jobinfo plugin wasn't used at this site at the time the issue was opened.

eric-roman avatar Nov 19 '19 20:11 eric-roman