jeromq
jeromq copied to clipboard
JDK epoll bug in 0.4.3
i use jeromq 0.4.3 with jdk 8.
sometimes Poller.run get 100% cpu usage without any message.
i add some debug info in the Poller and find that rc = selector.select(timeout); always get rc==1. so it can't hit maybeRebuildSelector method.
netty solve the issue by only use a counter, and ignore the 'rc'.
sorry for my poor english
in Poller.java
logger.debug("rc = {}", rc);
if (rc == 0) {
returnsImmediately = maybeRebuildSelector(returnsImmediately, timeout, start);
continue;
}
private int maybeRebuildSelector(int returnsImmediately, long timeout, long start) {
logger.error("Guess JDK epoll bug! returnsImmediately = {}, timeout = {}, start = {}", returnsImmediately,
timeout, start);
// Guess JDK epoll bug
if (timeout == 0 || System.currentTimeMillis() - start < timeout / 2) {
logger.error("returnsImmediately++");
returnsImmediately++;
} else {
logger.error("set returnsImmediately = 0");
returnsImmediately = 0;
}
if (returnsImmediately > 10) {
rebuildSelector();
returnsImmediately = 0;
}
return returnsImmediately;
}
the log is
2017-11-29 14:05:31.533 [iothread-2] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 14:05:31.533 [iothread-2] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 0, timeout = 0, start = 1511857684346
2017-11-29 14:05:31.533 [iothread-2] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 14:05:31.533 [iothread-2] DEBUG zmq.poll.Poller - rc = 1
2017-11-29 14:05:31.612 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 14:05:31.612 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 0, timeout = 0, start = 1511852348656
2017-11-29 14:05:31.612 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 14:05:31.612 [reaper-1] DEBUG zmq.poll.Poller - rc = 1
2017-11-29 14:05:31.612 [reaper-1] DEBUG zmq.poll.Poller - rc = 1
2017-11-29 14:05:31.612 [reaper-1] DEBUG zmq.poll.Poller - rc = 1
2017-11-29 14:05:31.612 [reaper-1] DEBUG zmq.poll.Poller - rc = 1
Hi! Thanks for reporting that, stacktrace is already cool (looks like the problem is located in the reaper), but we would need some more information to dig deeper on that topic. May you provide a minimal test case to demonstrate how it happens? Information about your OS, ... could be useful as well.
Regarding netty's tactic to detect and get rid of jdk epoll bug, is this the relevant commit, and if not could you provide some more color and insights?
I know this affects Android users as well. I was trying to narrow down a bug with @madrzak a little while ago.
@trevorbernard I'm a bit rusted, I cannot find ref for that. May you provide any pointers? Was it affecting pre 0.4.1 versions as well?
@fredoboulo we didn't create an issue, we were waiting until we narrowed it down more. But that's the same symptoms
It was definitely on 0.4.2, not sure if pre 0.4.1.
It occurred on Android after running for an extensive amount of time (>8 hours).
Poller would somehow end up in an infinite loop and eventually drain the battery. Here's the log:
11-08 09:25:53.697 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.697 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.697 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.697 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.697 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.698 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.698 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.698 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.698 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.698 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.698 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.698 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.699 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.699 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector
@fredoboulo My OS is windows 7. The steps to reproduce this issue is:
- run the application
- hibernate the OS for about an hour
- wakeup OS
- it may occur
I run 2 applications on the same computer, one is jeromq producer, the other is consumer. When the OS wakes up, the two apps may all crash.
here is the producer's log:
2017-11-29 17:46:35.595 [iothread-2] DEBUG zmq.poll.Poller - rc = 1
2017-11-29 17:46:35.565 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 0, timeout = 0, start = 1511936644295
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 1, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 2, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 3, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 4, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 5, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 6, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 7, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 8, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 9, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 10, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.597 [netty-listen-thread19] WARN i.n.channel.nio.NioEventLoop - Failed to re-register a Channel to the new Selector.
java.lang.ClassCastException: io.netty.channel.nio.SelectedSelectionKeySetSelector cannot be cast to java.nio.channels.spi.AbstractSelector
at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:212)
at io.netty.channel.nio.NioEventLoop.rebuildSelector0(NioEventLoop.java:347)
at io.netty.channel.nio.NioEventLoop.rebuildSelector(NioEventLoop.java:318)
at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:769)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:388)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at java.lang.Thread.run(Thread.java:748)
2017-11-29 17:46:35.598 [iothread-2] DEBUG zmq.poll.Poller - rc = 1
@madrzak, @ymmihw it seems that these are 2 different issues: one where rebuild selector has been performed after 8 hours and continuously rebuilds it; one where it never rebuilds selector. Both of these cases lead to 100% CPU usage. I suggest we fill 2 different issues unless you say otherwise.
@madrzak, @ymmihw did you have a try with latest snapshot? A fix occurred recently about infinite looping while sending messages.
@trevorbernard This issues still exists on Android using 0.4.3. I left a device running over the weekend and came in this morning to find the ZMQ connection had dropped and the poller was stuck in an infinite loop printing: "I/System.out: zmq.poll.Poller@c121d61 rebuilding selector"
I can't really provide any more information other than that unfortunately
@Nodstuff Can you provide the android sdk version and device information?
Is more information still needed on this issue? I'm experiencing this issue consistently with 0.4.3 and also 0.4.4-SNAPSHOT on Linux x64, HotSpot 8u181. (Is this possibly fixed in 0.5.0-SNAPSHOT? Is that one available somewhere to test?)
I am also experiencing this bug with JDK8 on Linux. Our stack looks as follows for the thread consuming 100% cpu
2018-12-07_18:03:45.24915 "iothread-2" #21 daemon prio=5 os_prio=0 tid=0x00007f3ae0cfd800 nid=0x4a3b runnable [0x00007f3afac07000]
2018-12-07_18:03:45.24915 java.lang.Thread.State: RUNNABLE
2018-12-07_18:03:45.24915 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
2018-12-07_18:03:45.24915 at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
2018-12-07_18:03:45.24915 at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
2018-12-07_18:03:45.24915 at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
2018-12-07_18:03:45.24916 - locked <0x00000005eaef8698> (a sun.nio.ch.Util$3)
2018-12-07_18:03:45.24916 - locked <0x00000005eaef8650> (a java.util.Collections$UnmodifiableSet)
2018-12-07_18:03:45.24916 - locked <0x00000005eb156da8> (a sun.nio.ch.EPollSelectorImpl)
2018-12-07_18:03:45.24916 at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
2018-12-07_18:03:45.24916 at zmq.poll.Poller.run(Poller.java:234)
2018-12-07_18:03:45.24917 at java.lang.Thread.run(Thread.java:748)
Any ideas how we can mitigate it? We are using ZMQ to forward error/access logs to splunk. I dont see any error messages in the log files that could point to the root cause.
@fredoboulo can you provide more insight into this? This is causing us a lot of problems. I noticed that the code we have has the fix for the jdk epoll bug in maybeRebuildSelector but we are still hitting this issue.
@ferozed can you provide more info? Which version of jeromq are you using, JVM, OS, ... Which problem are you hitting when going 100%? Rebuilding the selector, never rebuilding it? @milosonator Sorry for answering so late, I was away from this project, mostly for professional reasons. We would welcome any insights about that, there is something wrong happening, but without code reproducing the problem, it will be hard to work on a solution.
@fredoboulo Thanks for your quick response. Our scenario is as follows:
- JeromQ: 0.4.3
- JRE: 8 ( 8u161 )
- OS: Linux generic-hostname 3.16.0-6-amd64 #1 SMP Debian 3.16.57-2 (2018-07-14) x86_64 GNU/Linux
The problem is since it happens only in production, and that too after running a long time, we have no way of connecting debugger to figure out whether it is due to rebuilding the selector or never rebuilding it.
I am not sure if our Poller has the Netty fix. if it doesnt, is there any risk in adding it anyway? Since it seems to have fixed the problem for netty people atleast?
I added a PR ( which is for an integration test I wrote). This test implements a custom server talking ZMTP3. And if I close the socket rudely under the client, then it goes into a infinite spin loop with 0.4.3. However, the laster 0.4.4-SNAPSHOT in master does not have that bug.
@fredoboulo Thanks for getting back on this. I've not worked on that code in a while, so I haven't been affected by this in a while. I'll make sure to (at least attempt to) make a small reproduction sample. (Maybe the latest JVM fixes the issue?)
@ferozed that seems to be the hard point here: no clear view of what's happening, especially if the problem occurs only after running a long time... Only a trace in System.out indicates if the selector is being rebuilt. 0.4.3 should have the jdk epoll bug fix (it has present in a different way since 2013?). Looking at netty's own fix, we might consider resetting the returnsImmediately counter when an event actually occurred, even if it would not be a silver bullet. Note that rebuilding the selector should re-register channels on the new one.
Does that mean that the latest 0.4.4-SNAPSHOT fixes your problem? I'm not sure to understand correctly what you were saying
EDIT: I forgot to thank you for the code in the PR :1st_place_medal:
@milosonator if you could provide such a code, that would definitely be helpful! I'm not sure if changing the JVM would improve the situation, but if so, that's an information worth sharing with the community.
I'm facing the rebuilding selector issue also on android device using jeromq 0.4.3. When running monkey test for a long period. Up to now, only the system.out log printed the infinite loop of "rebuilding selector", no clear reproduce method, and no idea how this issue happened
Most badly, the io-thread rebuilding selector deadlock will eat lots of CPU, then cause many ANR to those who using jeromq.
any one could help?
@wanghe The latest stable version is 0.5.0. Can you try using that and see if the problem still persists?
@trevorbernard
- Currently I'm working on reproduce the issue and try to print the function call stack trace, then to see what is going on and maybe find a fix solution.
- Option 2 is upgrade to 0.5.0 But it is hard to reproduce, even using option 2, it still have risk
Thanks a lot! HE
from android OS log, cannot find any clue:
03-07 04:44:16.704 D/ZMQHelper( 736): onMsgReceived: TOPIC: IPC_SYSTEM_SETTINGS, PAYLOAD: 000000000400000000 03-07 04:44:16.704 E/ZMQHelper( 736): parsePayloadValue: len = 4, val = 0 03-07 04:44:16.704 D/ZMQHelper( 736): onMsgReceived: tag: 0, val: 0 03-07 04:44:16.704 D/DisplayPowerController( 736): updateClusterLightSensor: displayId = 2, lux = 0.0 03-07 04:44:16.704 D/ZMQHelper( 736): onMsgReceived: TOPIC: IPC_SYSTEM_SETTINGS, PAYLOAD: 03000000040000000A 03-07 04:44:16.704 E/ZMQHelper( 736): parsePayloadValue: len = 4, val = 10 03-07 04:44:16.704 D/ZMQHelper( 736): onMsgReceived: tag: 3, val: 10 03-07 04:44:16.702 I/chatty ( 736): uid=1000(system) InputDispatcher identical 1 line 03-07 04:44:16.705 D/InputDispatcher( 736): Found new focused window : name='Window{347ebc2 u0 com.android.certinstaller/com.android.certinstaller.CertInstallerMain}', displayId=0 03-07 04:44:16.707 I/zygote64(25177): After code cache collection, code=48KB, data=54KB 03-07 04:44:16.707 I/zygote64(25177): Increasing code cache capacity to 256KB 03-07 04:44:16.711 I/System.out( 1416): zmq.poll.Poller@543e4b0 rebuilding selector 03-07 04:44:16.712 W/System.err( 1416): java.lang.Exception 03-07 04:44:16.712 W/System.err( 1416): at zmq.poll.Poller.rebuildSelector(Poller.java:313) 03-07 04:44:16.712 W/System.err( 1416): at zmq.poll.Poller.maybeRebuildSelector(Poller.java:300) 03-07 04:44:16.712 W/System.err( 1416): at zmq.poll.Poller.run(Poller.java:249) 03-07 04:44:16.713 W/System.err( 1416): at java.lang.Thread.run(Thread.java:764) 03-07 04:44:16.713 I/System.out( 2463): zmq.poll.Poller@14ffb5d rebuilding selector 03-07 04:44:16.722 W/System.err( 2463): java.lang.Exception 03-07 04:44:16.723 W/System.err( 2463): at zmq.poll.Poller.rebuildSelector(Poller.java:313) 03-07 04:44:16.723 W/System.err( 2463): at zmq.poll.Poller.maybeRebuildSelector(Poller.java:300) 03-07 04:44:16.723 W/System.err( 2463): at zmq.poll.Poller.run(Poller.java:249) 03-07 04:44:16.723 W/System.err( 2463): at java.lang.Thread.run(Thread.java:764) 03-07 04:44:16.728 I/System.out( 3492): zmq.poll.Poller@32a03d8 rebuilding selector 03-07 04:44:16.728 I/System.out( 1824): zmq.poll.Poller@bfdec9d rebuilding selector
currently, the issue is: even we rebuild the selector, the selector abnormal return 0 will continue exist, so the deadloop raise.
any idea to recovery from the trap? restart the JVM?