Ambry nodes getting stuck in massive syscall making loop after a while
After running for a while some of my Ambry nodes get caught in an unbound loop and start making a massive number of syscalls (basically as much as they can until they hit the CPU ceiling). This happens on both frontend nodes and on storage nodes. I should mention we run Ambry on SmartOS (ie. Solaris). So this could either be an issue specific to the Solaris JVM (and maybe others like FreeBSD) or an issue with Ambry which simply doesn't surface (yet) on Linux.
Examining a frontend node in a botched state with truss (ie. Solaris strace) looks like this:
845203/61: 0.2340 ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730) = 1
845203/61: 0.2341 ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730) = 1
845203/61: 0.2341 ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730) = 1
845203/61: 0.2342 ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730) = 1
845203/61: 0.2343 ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730) = 1
845203/61: 0.2343 ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730) = 1
845203/61: 0.2344 ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730) = 1
845203/61: 0.2345 ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730) = 1
As can be seen above the JVM is making a massive number of calls to /dev/poll.
By making a Java thread dump I can correlate the ID of the thread (61) in the above output of truss to the following Java thread (`nid=0x3d):
"RequestResponseHandlerThread-0" #22 daemon prio=5 os_prio=64 tid=0x0000000001ecb800 nid=0x3d runnable [0xffffbf7fceefd000]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.DevPollArrayWrapper.poll0(Native Method)
at sun.nio.ch.DevPollArrayWrapper.poll(DevPollArrayWrapper.java:223)
at sun.nio.ch.DevPollSelectorImpl.doSelect(DevPollSelectorImpl.java:98)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked <0x0000000770272010> (a sun.nio.ch.Util$3)
- locked <0x0000000770272000> (a java.util.Collections$UnmodifiableSet)
- locked <0x000000077025d610> (a sun.nio.ch.DevPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
at com.github.ambry.network.Selector.select(Selector.java:469)
at com.github.ambry.network.Selector.poll(Selector.java:322)
at com.github.ambry.network.NetworkClient.sendAndPoll(NetworkClient.java:107)
at com.github.ambry.router.NonBlockingRouter$OperationController.run(NonBlockingRouter.java:722)
at java.lang.Thread.run(Thread.java:748)
I'm hoping you guys might have some insight (or a hunch) what might be happening here.