jeromq icon indicating copy to clipboard operation
jeromq copied to clipboard

JDK epoll bug in 0.4.3

Open ymmihw opened this issue 7 years ago • 24 comments

i use jeromq 0.4.3 with jdk 8. sometimes Poller.run get 100% cpu usage without any message. i add some debug info in the Poller and find that rc = selector.select(timeout); always get rc==1. so it can't hit maybeRebuildSelector method.

netty solve the issue by only use a counter, and ignore the 'rc'.

sorry for my poor english

ymmihw avatar Nov 27 '17 10:11 ymmihw

in Poller.java

            logger.debug("rc = {}", rc);
            if (rc == 0) {
                returnsImmediately = maybeRebuildSelector(returnsImmediately, timeout, start);
                continue;
            }
    private int maybeRebuildSelector(int returnsImmediately, long timeout, long start) {
        logger.error("Guess JDK epoll bug! returnsImmediately = {}, timeout = {}, start = {}", returnsImmediately,
                timeout, start);
        // Guess JDK epoll bug
        if (timeout == 0 || System.currentTimeMillis() - start < timeout / 2) {
            logger.error("returnsImmediately++");
            returnsImmediately++;
        } else {
            logger.error("set returnsImmediately = 0");
            returnsImmediately = 0;
        }

        if (returnsImmediately > 10) {
            rebuildSelector();
            returnsImmediately = 0;
        }
        return returnsImmediately;
    }

the log is

2017-11-29 14:05:31.533 [iothread-2] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 14:05:31.533 [iothread-2] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 0, timeout = 0, start = 1511857684346
2017-11-29 14:05:31.533 [iothread-2] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 14:05:31.533 [iothread-2] DEBUG zmq.poll.Poller - rc = 1
2017-11-29 14:05:31.612 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 14:05:31.612 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 0, timeout = 0, start = 1511852348656
2017-11-29 14:05:31.612 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 14:05:31.612 [reaper-1] DEBUG zmq.poll.Poller - rc = 1
2017-11-29 14:05:31.612 [reaper-1] DEBUG zmq.poll.Poller - rc = 1
2017-11-29 14:05:31.612 [reaper-1] DEBUG zmq.poll.Poller - rc = 1
2017-11-29 14:05:31.612 [reaper-1] DEBUG zmq.poll.Poller - rc = 1

ymmihw avatar Nov 29 '17 06:11 ymmihw

Hi! Thanks for reporting that, stacktrace is already cool (looks like the problem is located in the reaper), but we would need some more information to dig deeper on that topic. May you provide a minimal test case to demonstrate how it happens? Information about your OS, ... could be useful as well.

Regarding netty's tactic to detect and get rid of jdk epoll bug, is this the relevant commit, and if not could you provide some more color and insights?

fredoboulo avatar Nov 29 '17 20:11 fredoboulo

I know this affects Android users as well. I was trying to narrow down a bug with @madrzak a little while ago.

trevorbernard avatar Nov 29 '17 21:11 trevorbernard

@trevorbernard I'm a bit rusted, I cannot find ref for that. May you provide any pointers? Was it affecting pre 0.4.1 versions as well?

fredoboulo avatar Nov 29 '17 21:11 fredoboulo

@fredoboulo we didn't create an issue, we were waiting until we narrowed it down more. But that's the same symptoms

trevorbernard avatar Nov 29 '17 23:11 trevorbernard

It was definitely on 0.4.2, not sure if pre 0.4.1.

It occurred on Android after running for an extensive amount of time (>8 hours).

Poller would somehow end up in an infinite loop and eventually drain the battery. Here's the log:

11-08 09:25:53.697 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.697 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.697 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.697 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.697 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.698 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.698 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.698 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.698 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.698 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.698 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.698 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.699 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector 11-08 09:25:53.699 I/System.out: zmq.poll.Poller@e92c35e rebuilding selector

madrzak avatar Nov 29 '17 23:11 madrzak

@fredoboulo My OS is windows 7. The steps to reproduce this issue is:

  1. run the application
  2. hibernate the OS for about an hour
  3. wakeup OS
  4. it may occur

I run 2 applications on the same computer, one is jeromq producer, the other is consumer. When the OS wakes up, the two apps may all crash.

here is the producer's log:

2017-11-29 17:46:35.595 [iothread-2] DEBUG zmq.poll.Poller - rc = 1
2017-11-29 17:46:35.565 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 0, timeout = 0, start = 1511936644295
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 1, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 2, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 3, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 4, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 5, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 6, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 7, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 8, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 9, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.596 [reaper-1] DEBUG zmq.poll.Poller - rc = 0
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - Guess JDK epoll bug! returnsImmediately = 10, timeout = 0, start = 1511948795596
2017-11-29 17:46:35.596 [reaper-1] ERROR zmq.poll.Poller - returnsImmediately++
2017-11-29 17:46:35.597 [netty-listen-thread19] WARN  i.n.channel.nio.NioEventLoop - Failed to re-register a Channel to the new Selector.
java.lang.ClassCastException: io.netty.channel.nio.SelectedSelectionKeySetSelector cannot be cast to java.nio.channels.spi.AbstractSelector
	at java.nio.channels.spi.AbstractSelectableChannel.register(AbstractSelectableChannel.java:212)
	at io.netty.channel.nio.NioEventLoop.rebuildSelector0(NioEventLoop.java:347)
	at io.netty.channel.nio.NioEventLoop.rebuildSelector(NioEventLoop.java:318)
	at io.netty.channel.nio.NioEventLoop.select(NioEventLoop.java:769)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:388)
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
	at java.lang.Thread.run(Thread.java:748)
2017-11-29 17:46:35.598 [iothread-2] DEBUG zmq.poll.Poller - rc = 1

ymmihw avatar Nov 30 '17 01:11 ymmihw

@madrzak, @ymmihw it seems that these are 2 different issues: one where rebuild selector has been performed after 8 hours and continuously rebuilds it; one where it never rebuilds selector. Both of these cases lead to 100% CPU usage. I suggest we fill 2 different issues unless you say otherwise.

fredoboulo avatar Nov 30 '17 19:11 fredoboulo

@madrzak, @ymmihw did you have a try with latest snapshot? A fix occurred recently about infinite looping while sending messages.

fredoboulo avatar Apr 02 '18 19:04 fredoboulo

@trevorbernard This issues still exists on Android using 0.4.3. I left a device running over the weekend and came in this morning to find the ZMQ connection had dropped and the poller was stuck in an infinite loop printing: "I/System.out: zmq.poll.Poller@c121d61 rebuilding selector"

I can't really provide any more information other than that unfortunately

Nodstuff avatar Oct 15 '18 08:10 Nodstuff

@Nodstuff Can you provide the android sdk version and device information?

trevorbernard avatar Oct 15 '18 16:10 trevorbernard

Is more information still needed on this issue? I'm experiencing this issue consistently with 0.4.3 and also 0.4.4-SNAPSHOT on Linux x64, HotSpot 8u181. (Is this possibly fixed in 0.5.0-SNAPSHOT? Is that one available somewhere to test?)

milosonator avatar Nov 26 '18 17:11 milosonator

I am also experiencing this bug with JDK8 on Linux. Our stack looks as follows for the thread consuming 100% cpu

2018-12-07_18:03:45.24915 "iothread-2" #21 daemon prio=5 os_prio=0 tid=0x00007f3ae0cfd800 nid=0x4a3b runnable [0x00007f3afac07000]
2018-12-07_18:03:45.24915    java.lang.Thread.State: RUNNABLE
2018-12-07_18:03:45.24915       at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
2018-12-07_18:03:45.24915       at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
2018-12-07_18:03:45.24915       at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:93)
2018-12-07_18:03:45.24915       at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
2018-12-07_18:03:45.24916       - locked <0x00000005eaef8698> (a sun.nio.ch.Util$3)
2018-12-07_18:03:45.24916       - locked <0x00000005eaef8650> (a java.util.Collections$UnmodifiableSet)
2018-12-07_18:03:45.24916       - locked <0x00000005eb156da8> (a sun.nio.ch.EPollSelectorImpl)
2018-12-07_18:03:45.24916       at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
2018-12-07_18:03:45.24916       at zmq.poll.Poller.run(Poller.java:234)
2018-12-07_18:03:45.24917       at java.lang.Thread.run(Thread.java:748)

Any ideas how we can mitigate it? We are using ZMQ to forward error/access logs to splunk. I dont see any error messages in the log files that could point to the root cause.

ferozed avatar Dec 08 '18 17:12 ferozed

@fredoboulo can you provide more insight into this? This is causing us a lot of problems. I noticed that the code we have has the fix for the jdk epoll bug in maybeRebuildSelector but we are still hitting this issue.

ferozed avatar Feb 04 '19 18:02 ferozed

@ferozed can you provide more info? Which version of jeromq are you using, JVM, OS, ... Which problem are you hitting when going 100%? Rebuilding the selector, never rebuilding it? @milosonator Sorry for answering so late, I was away from this project, mostly for professional reasons. We would welcome any insights about that, there is something wrong happening, but without code reproducing the problem, it will be hard to work on a solution.

fredoboulo avatar Feb 06 '19 00:02 fredoboulo

@fredoboulo Thanks for your quick response. Our scenario is as follows:

  • JeromQ: 0.4.3
  • JRE: 8 ( 8u161 )
  • OS: Linux generic-hostname 3.16.0-6-amd64 #1 SMP Debian 3.16.57-2 (2018-07-14) x86_64 GNU/Linux

The problem is since it happens only in production, and that too after running a long time, we have no way of connecting debugger to figure out whether it is due to rebuilding the selector or never rebuilding it.

I am not sure if our Poller has the Netty fix. if it doesnt, is there any risk in adding it anyway? Since it seems to have fixed the problem for netty people atleast?

I added a PR ( which is for an integration test I wrote). This test implements a custom server talking ZMTP3. And if I close the socket rudely under the client, then it goes into a infinite spin loop with 0.4.3. However, the laster 0.4.4-SNAPSHOT in master does not have that bug.

ferozed avatar Feb 06 '19 01:02 ferozed

@fredoboulo Thanks for getting back on this. I've not worked on that code in a while, so I haven't been affected by this in a while. I'll make sure to (at least attempt to) make a small reproduction sample. (Maybe the latest JVM fixes the issue?)

milosonator avatar Feb 06 '19 16:02 milosonator

@ferozed that seems to be the hard point here: no clear view of what's happening, especially if the problem occurs only after running a long time... Only a trace in System.out indicates if the selector is being rebuilt. 0.4.3 should have the jdk epoll bug fix (it has present in a different way since 2013?). Looking at netty's own fix, we might consider resetting the returnsImmediately counter when an event actually occurred, even if it would not be a silver bullet. Note that rebuilding the selector should re-register channels on the new one.

Does that mean that the latest 0.4.4-SNAPSHOT fixes your problem? I'm not sure to understand correctly what you were saying

EDIT: I forgot to thank you for the code in the PR :1st_place_medal:

fredoboulo avatar Feb 06 '19 22:02 fredoboulo

@milosonator if you could provide such a code, that would definitely be helpful! I'm not sure if changing the JVM would improve the situation, but if so, that's an information worth sharing with the community.

fredoboulo avatar Feb 06 '19 23:02 fredoboulo

I'm facing the rebuilding selector issue also on android device using jeromq 0.4.3. When running monkey test for a long period. Up to now, only the system.out log printed the infinite loop of "rebuilding selector", no clear reproduce method, and no idea how this issue happened

Most badly, the io-thread rebuilding selector deadlock will eat lots of CPU, then cause many ANR to those who using jeromq.

any one could help?

wanghe avatar Mar 05 '19 10:03 wanghe

@wanghe The latest stable version is 0.5.0. Can you try using that and see if the problem still persists?

trevorbernard avatar Mar 05 '19 13:03 trevorbernard

@trevorbernard

  1. Currently I'm working on reproduce the issue and try to print the function call stack trace, then to see what is going on and maybe find a fix solution.
  2. Option 2 is upgrade to 0.5.0 But it is hard to reproduce, even using option 2, it still have risk

Thanks a lot! HE

wanghe avatar Mar 05 '19 15:03 wanghe

from android OS log, cannot find any clue:

03-07 04:44:16.704 D/ZMQHelper( 736): onMsgReceived: TOPIC: IPC_SYSTEM_SETTINGS, PAYLOAD: 000000000400000000 03-07 04:44:16.704 E/ZMQHelper( 736): parsePayloadValue: len = 4, val = 0 03-07 04:44:16.704 D/ZMQHelper( 736): onMsgReceived: tag: 0, val: 0 03-07 04:44:16.704 D/DisplayPowerController( 736): updateClusterLightSensor: displayId = 2, lux = 0.0 03-07 04:44:16.704 D/ZMQHelper( 736): onMsgReceived: TOPIC: IPC_SYSTEM_SETTINGS, PAYLOAD: 03000000040000000A 03-07 04:44:16.704 E/ZMQHelper( 736): parsePayloadValue: len = 4, val = 10 03-07 04:44:16.704 D/ZMQHelper( 736): onMsgReceived: tag: 3, val: 10 03-07 04:44:16.702 I/chatty ( 736): uid=1000(system) InputDispatcher identical 1 line 03-07 04:44:16.705 D/InputDispatcher( 736): Found new focused window : name='Window{347ebc2 u0 com.android.certinstaller/com.android.certinstaller.CertInstallerMain}', displayId=0 03-07 04:44:16.707 I/zygote64(25177): After code cache collection, code=48KB, data=54KB 03-07 04:44:16.707 I/zygote64(25177): Increasing code cache capacity to 256KB 03-07 04:44:16.711 I/System.out( 1416): zmq.poll.Poller@543e4b0 rebuilding selector 03-07 04:44:16.712 W/System.err( 1416): java.lang.Exception 03-07 04:44:16.712 W/System.err( 1416): at zmq.poll.Poller.rebuildSelector(Poller.java:313) 03-07 04:44:16.712 W/System.err( 1416): at zmq.poll.Poller.maybeRebuildSelector(Poller.java:300) 03-07 04:44:16.712 W/System.err( 1416): at zmq.poll.Poller.run(Poller.java:249) 03-07 04:44:16.713 W/System.err( 1416): at java.lang.Thread.run(Thread.java:764) 03-07 04:44:16.713 I/System.out( 2463): zmq.poll.Poller@14ffb5d rebuilding selector 03-07 04:44:16.722 W/System.err( 2463): java.lang.Exception 03-07 04:44:16.723 W/System.err( 2463): at zmq.poll.Poller.rebuildSelector(Poller.java:313) 03-07 04:44:16.723 W/System.err( 2463): at zmq.poll.Poller.maybeRebuildSelector(Poller.java:300) 03-07 04:44:16.723 W/System.err( 2463): at zmq.poll.Poller.run(Poller.java:249) 03-07 04:44:16.723 W/System.err( 2463): at java.lang.Thread.run(Thread.java:764) 03-07 04:44:16.728 I/System.out( 3492): zmq.poll.Poller@32a03d8 rebuilding selector 03-07 04:44:16.728 I/System.out( 1824): zmq.poll.Poller@bfdec9d rebuilding selector

wanghe avatar Mar 07 '19 06:03 wanghe

currently, the issue is: even we rebuild the selector, the selector abnormal return 0 will continue exist, so the deadloop raise.

any idea to recovery from the trap? restart the JVM?

wanghe avatar Mar 08 '19 08:03 wanghe