ambry icon indicating copy to clipboard operation
ambry copied to clipboard

Ambry nodes getting stuck in massive syscall making loop after a while

Open siepkes opened this issue 7 years ago • 0 comments

After running for a while some of my Ambry nodes get caught in an unbound loop and start making a massive number of syscalls (basically as much as they can until they hit the CPU ceiling). This happens on both frontend nodes and on storage nodes. I should mention we run Ambry on SmartOS (ie. Solaris). So this could either be an issue specific to the Solaris JVM (and maybe others like FreeBSD) or an issue with Ambry which simply doesn't surface (yet) on Linux.

Examining a frontend node in a botched state with truss (ie. Solaris strace) looks like this:

845203/61:	 0.2340	ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730)		= 1
845203/61:	 0.2341	ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730)		= 1
845203/61:	 0.2341	ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730)		= 1
845203/61:	 0.2342	ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730)		= 1
845203/61:	 0.2343	ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730)		= 1
845203/61:	 0.2343	ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730)		= 1
845203/61:	 0.2344	ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730)		= 1
845203/61:	 0.2345	ioctl(22, DP_POLL, 0xFFFFBF7FCEEFD730)		= 1

As can be seen above the JVM is making a massive number of calls to /dev/poll.

By making a Java thread dump I can correlate the ID of the thread (61) in the above output of truss to the following Java thread (`nid=0x3d):

"RequestResponseHandlerThread-0" #22 daemon prio=5 os_prio=64 tid=0x0000000001ecb800 nid=0x3d runnable [0xffffbf7fceefd000]
   java.lang.Thread.State: RUNNABLE
	at sun.nio.ch.DevPollArrayWrapper.poll0(Native Method)
	at sun.nio.ch.DevPollArrayWrapper.poll(DevPollArrayWrapper.java:223)
	at sun.nio.ch.DevPollSelectorImpl.doSelect(DevPollSelectorImpl.java:98)
	at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
	- locked <0x0000000770272010> (a sun.nio.ch.Util$3)
	- locked <0x0000000770272000> (a java.util.Collections$UnmodifiableSet)
	- locked <0x000000077025d610> (a sun.nio.ch.DevPollSelectorImpl)
	at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
	at com.github.ambry.network.Selector.select(Selector.java:469)
	at com.github.ambry.network.Selector.poll(Selector.java:322)
	at com.github.ambry.network.NetworkClient.sendAndPoll(NetworkClient.java:107)
	at com.github.ambry.router.NonBlockingRouter$OperationController.run(NonBlockingRouter.java:722)
	at java.lang.Thread.run(Thread.java:748)

I'm hoping you guys might have some insight (or a hunch) what might be happening here.

siepkes avatar Sep 25 '18 10:09 siepkes