Health check leaks file descriptors when DNS servers are unavailable
The problem
Styx 0.7 (possibly 1.n too) health check functionality leaks file descriptors when no domain name servers are available.
This is easy to reproduce in Vagrant or in Docker.
- Configure origin(s) with health checks.
- Watch the file descriptor count increase
- Start styx
- Modify name server IP address in
/etc/resolv.confto something non-existent.
Detailed description
When no name servers are available, InetAddress.getByName() appears to block for 20 seconds, blocking the Netty epoll event loop. In the meanwhile the health check monitor keeps creating more connections subsequent polls. They get queued up in Netty executor queue behind the slow name resolution.
"Health-Check-Monitor-app-Client-Worker-0-Thread@5018" prio=5 tid=0x13 nid=NA runnable
java.lang.Thread.State: RUNNABLE
at java.net.Inet4AddressImpl.lookupAllHostAddr(Inet4AddressImpl.java:-1)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
at java.net.InetAddress.getAllByName(InetAddress.java:1193)
at java.net.InetAddress.getAllByName(InetAddress.java:1127)
at java.net.InetAddress.getByName(InetAddress.java:1077)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
at java.security.AccessController.doPrivileged(AccessController.java:-1)
at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:915)
at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:309)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at java.lang.Thread.run(Thread.java:748)
Acceptance criteria
File descriptor count should remain flat.
Thank you @mikkokar for creating this issue.
A couple of other observations:
- It appeared to leak file handles only when there are more than one origins for the application.
- Shorten DNS timeout and retry attempts helped alleviate the issue.
- I had no luck reproducing it on MacOS, not sure if this is OS specific as well.
Thanks, Xiuwen
Another thing to add: We saw this issue when only a subset of DNS servers was unavailable.
Revisit this issue after Netty upgrade: PR #484.
Closing all issues over 3 years old. New issues can be created if problems are still occurring.