styx icon indicating copy to clipboard operation
styx copied to clipboard

Health check leaks file descriptors when DNS servers are unavailable

Open mikkokar opened this issue 6 years ago • 2 comments

The problem

Styx 0.7 (possibly 1.n too) health check functionality leaks file descriptors when no domain name servers are available.

This is easy to reproduce in Vagrant or in Docker.

  1. Configure origin(s) with health checks.
  2. Watch the file descriptor count increase
  3. Start styx
  4. Modify name server IP address in /etc/resolv.conf to something non-existent.

Detailed description

When no name servers are available, InetAddress.getByName() appears to block for 20 seconds, blocking the Netty epoll event loop. In the meanwhile the health check monitor keeps creating more connections subsequent polls. They get queued up in Netty executor queue behind the slow name resolution.

"Health-Check-Monitor-app-Client-Worker-0-Thread@5018" prio=5 tid=0x13 nid=NA runnable
  java.lang.Thread.State: RUNNABLE
      at java.net.Inet4AddressImpl.lookupAllHostAddr(Inet4AddressImpl.java:-1)
      at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929)
      at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324)
      at java.net.InetAddress.getAllByName0(InetAddress.java:1277)
      at java.net.InetAddress.getAllByName(InetAddress.java:1193)
      at java.net.InetAddress.getAllByName(InetAddress.java:1127)
      at java.net.InetAddress.getByName(InetAddress.java:1077)
      at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
      at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
      at java.security.AccessController.doPrivileged(AccessController.java:-1)
      at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
      at io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
      at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
      at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
      at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
      at io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
      at io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
      at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
      at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
      at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
      at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
      at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
      at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
      at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
      at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
      at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
      at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:915)
      at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
      at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
      at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
      at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
      at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
      at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:309)
      at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
      at java.lang.Thread.run(Thread.java:748)

Acceptance criteria

File descriptor count should remain flat.

mikkokar avatar May 03 '19 09:05 mikkokar

Thank you @mikkokar for creating this issue.

A couple of other observations:

  1. It appeared to leak file handles only when there are more than one origins for the application.
  2. Shorten DNS timeout and retry attempts helped alleviate the issue.
  3. I had no luck reproducing it on MacOS, not sure if this is OS specific as well.

Thanks, Xiuwen

Another thing to add: We saw this issue when only a subset of DNS servers was unavailable.

xiuwyang avatar May 05 '19 20:05 xiuwyang

Revisit this issue after Netty upgrade: PR #484.

mikkokar avatar Oct 14 '19 10:10 mikkokar

Closing all issues over 3 years old. New issues can be created if problems are still occurring.

kvosper avatar Jan 11 '24 11:01 kvosper