ovis File descriptors leaking

I seeing our ldms aggregators leak file descriptors. We have an aggregator fanout of 2500:1 over the ugni transport.

We're running from an RC (commit 9f4774ca) but I was able to reproduce the problem from OVIS-4.3.3 as well.

From strace at the time the leaks occur we see the following. (And we don't see this trace when file descriptors aren't leaking AFAICT)

[pid 58960] 14:10:35 socket(AF_INET, SOCK_STREAM, IPPROTO_IP <unfinished ...>
[pid 58960] 14:10:35 <... socket resumed> ) = 75012
[pid 58960] 14:10:35 fcntl(75012, F_GETFL <unfinished ...>
[pid 58960] 14:10:35 <... fcntl resumed> ) = 0x2 (flags O_RDWR)
[pid 58960] 14:10:35 fcntl(75012, F_SETFL, O_RDWR|O_NONBLOCK <unfinished ...>
[pid 58960] 14:10:35 <... fcntl resumed> ) = 0
[pid 58960] 14:10:35 connect(75012, {sa_family=AF_INET, sin_port=htons(411), sin_addr=inet_addr("10.128.39.42")}, 16 <unfinished ...>
[pid 58960] 14:10:35 <... connect resumed> ) = -1 EINPROGRESS (Operation now in progress)

Tracing down I expect this is occurs in z_ugni_connect, since there are calls to fcntl there followed by a call to connect. (The socket transport does this too but we don't use it)

(Perhaps a new connection is being established when a previous one is already in progress but never established?)

Jan 09 '20 18:01 eric-roman

@eric-roman this looks like an error-path bug. I expect that we're leaking these fd when we attempt to re-connect to a node that is down. So basically 1 fd every 20s for each down node over ugni.

Jan 09 '20 18:01 tom95858

How can we make progress on this? Our aggregators leaked about 7400 descriptors per hour today. I've been restarting the aggregators every 6 hrs to work around the problem.

Jan 14 '20 00:01 eric-roman

Hi @eric-roman. Could you please pull master and see if the fix works for you? Synchronous connect errors on both the sock and ugni transports were leaking fd. If that's the path that was leaking on your system, the change should fix the issue. I believe that you only need to update the aggregator to test the change.

Jan 14 '20 17:01 tom95858

The OVIS-4 fix branch is now OVIS-4 instead of master.

Jan 14 '20 19:01 tom95858

Ok, I see from your log of the error, it's actually not the synchronous fail path. Don't bother testing, it won't fix it. Stay tuned.

Jan 14 '20 19:01 tom95858

Could you ls -l /proc//fd and send me the output? I'm not able to reproduce this issue with either a host that does not respond (i.e. a bad IP address), a bad port number, or invalid auth.

The prdcr state is supposed to prevent a reconnect when there is a connection attempt outstanding; if for some reason it did do so, however, the transport would cause a synchronous error (which is fixed in the top of tree right now)

I've enclosed a patch that will handle the reconnect reconnect.patch.txt

case, but it really shouldn't happen unless there is some other breakage. If it does fix it though it will tell use we've got something more sinister going on.

What is the state of the host(s) that are causing this?

Jan 14 '20 22:01 tom95858

Attached

ldmsd.aggregator.fd.txt

Jan 15 '20 22:01 eric-roman

Hi @eric-roman, have you had a chance to try the patch I sent? Also, when might we be able to set up a zoom to debug this live?

Jan 25 '20 18:01 tom95858

Hi @tom95858 I have. It's showing the same behavior.

Jan 27 '20 19:01 eric-roman

Hi @eric-roman has this been resolved? If so, can you close this issue.

Aug 26 '20 23:08 tom95858

I'll need to test a recent branch. We disabled the stream subscription that introduced the original problem, and haven't re-enabled it. Is OVIS-4 ready for testing?

Aug 28 '20 15:08 eric-roman

@eric-roman I've extended the dstat sampler (daemon, watch yourself) to optionally monitor file descriptor usage. This pull request did not make it into 4.3.4-ga, but the sampler patch is easily applied if you want to use it routinely.

Nov 06 '20 19:11 baallan

ovis ovis copied to clipboard

File descriptors leaking

ovis
ovis copied to clipboard