ovis
ovis copied to clipboard
File descriptors leaking
I seeing our ldms aggregators leak file descriptors. We have an aggregator fanout of 2500:1 over the ugni transport.
We're running from an RC (commit 9f4774ca) but I was able to reproduce the problem from OVIS-4.3.3 as well.
From strace at the time the leaks occur we see the following. (And we don't see this trace when file descriptors aren't leaking AFAICT)
[pid 58960] 14:10:35 socket(AF_INET, SOCK_STREAM, IPPROTO_IP <unfinished ...>
[pid 58960] 14:10:35 <... socket resumed> ) = 75012
[pid 58960] 14:10:35 fcntl(75012, F_GETFL <unfinished ...>
[pid 58960] 14:10:35 <... fcntl resumed> ) = 0x2 (flags O_RDWR)
[pid 58960] 14:10:35 fcntl(75012, F_SETFL, O_RDWR|O_NONBLOCK <unfinished ...>
[pid 58960] 14:10:35 <... fcntl resumed> ) = 0
[pid 58960] 14:10:35 connect(75012, {sa_family=AF_INET, sin_port=htons(411), sin_addr=inet_addr("10.128.39.42")}, 16 <unfinished ...>
[pid 58960] 14:10:35 <... connect resumed> ) = -1 EINPROGRESS (Operation now in progress)
Tracing down I expect this is occurs in z_ugni_connect, since there are calls to fcntl there followed by a call to connect. (The socket transport does this too but we don't use it)
(Perhaps a new connection is being established when a previous one is already in progress but never established?)
@eric-roman this looks like an error-path bug. I expect that we're leaking these fd when we attempt to re-connect to a node that is down. So basically 1 fd every 20s for each down node over ugni.
How can we make progress on this? Our aggregators leaked about 7400 descriptors per hour today. I've been restarting the aggregators every 6 hrs to work around the problem.
Hi @eric-roman. Could you please pull master and see if the fix works for you? Synchronous connect errors on both the sock and ugni transports were leaking fd. If that's the path that was leaking on your system, the change should fix the issue. I believe that you only need to update the aggregator to test the change.
The OVIS-4 fix branch is now OVIS-4 instead of master.
Ok, I see from your log of the error, it's actually not the synchronous fail path. Don't bother testing, it won't fix it. Stay tuned.
Could you ls -l /proc/
The prdcr state is supposed to prevent a reconnect when there is a connection attempt outstanding; if for some reason it did do so, however, the transport would cause a synchronous error (which is fixed in the top of tree right now)
I've enclosed a patch that will handle the reconnect reconnect.patch.txt
case, but it really shouldn't happen unless there is some other breakage. If it does fix it though it will tell use we've got something more sinister going on.
What is the state of the host(s) that are causing this?
Hi @eric-roman, have you had a chance to try the patch I sent? Also, when might we be able to set up a zoom to debug this live?
Hi @tom95858 I have. It's showing the same behavior.
Hi @eric-roman has this been resolved? If so, can you close this issue.
I'll need to test a recent branch. We disabled the stream subscription that introduced the original problem, and haven't re-enabled it. Is OVIS-4 ready for testing?
@eric-roman I've extended the dstat sampler (daemon, watch yourself) to optionally monitor file descriptor usage. This pull request did not make it into 4.3.4-ga, but the sampler patch is easily applied if you want to use it routinely.