jeromq icon indicating copy to clipboard operation
jeromq copied to clipboard

Simple Request/Response server side hangs at rc = selector.select(0); in Signaler.waitEvent()

Open ordersinrandom opened this issue 9 years ago • 17 comments

I have written a pair of simple Request / Response to try load testing the performance of JEROMQ. Tried compiling them with JDK 1.8 on 64bit Windows 7.

If I just start the server, and launch a client using two threads (i.e.) first parameter 2 second parameter let's say 50000 for the _TestRequestResponseClient1_ class.

The server _TestRequestResponseServer1_ occassionally waits forever at the Java NIO line of code selector.select(0) at the jeromq internal _Signaler_ class.

Or alternatively if I start two separate clients using single thread it also fails in the same way.

Once the server hits this issue, it does not go away without server side restart. If I just restart all clients, the server still stucks at the same place not detecting any incoming network signal and not responding to any clients. This renders the JEROMQ unusable unless we can guarantee single client and single server setup.

This only happens on version 0.3.5. Using an older version 0.3.4 it works perfectly. Perhaps some refactoring bugs introduced in the new release?

I also tried re-writing the server in C# using NetMQ and connect to it using the same Java client and there is no issue. Thus I am quite sure it is related to some new hidden server side NIO bug in JEROMQ 0.3.5.

    public boolean waitEvent(long timeout)
    {
        int rc = 0;

        try {
            if (timeout == 0) {
                // waitEvent(0) is called every read/send of SocketBase
                // instant readiness is not strictly required
                // On the other hand, we can save lots of system call and increase performance
                return rcursor < wcursor.get();

            }
            else if (timeout < 0) {
                rc = selector.select(0);
            }

_Server_

package com.mqtest;

import org.zeromq.ZMQ;
import org.zeromq.ZMQ.Context;
import org.zeromq.ZMQ.Socket;

public class TestRequestResponseServer1 {

    public static void main(String[] args) {

        int currentServCount = 0;
        try (Context context = ZMQ.context(1);) {

            try (Socket responder = context.socket(ZMQ.REP);) {

                String address = "tcp://localhost:5555";
                responder.bind(address);
                System.out.println("Server bound at " + address);

                while (!Thread.currentThread().isInterrupted()) {
                    try {
                        String incomingMessage = responder.recvStr();
                        responder.send("Server Replied [" + currentServCount
                                + "/" + Thread.currentThread().getId()
                                + "] of " + incomingMessage);
                        currentServCount++;

                        if (currentServCount % 1000 == 0)
                            System.out.println("Served " + currentServCount);
                    } catch (Exception e2) {
                        System.out.println(">>>>>>>>>>got exception "
                                + e2.getMessage());
                    }

                }
                System.out.println("server quitting.. interrupted");
                responder.close();
            }
        }
    }
}

_Client_

package com.mqtest;

import org.zeromq.ZMQ;
import org.zeromq.ZMQ.Context;
import org.zeromq.ZMQ.Socket;

public class TestRequestResponseClient1 {

    public static void main(String[] args) throws Exception {

        int threadCount = Integer.valueOf(args[0]).intValue();

        int loopCount = Integer.valueOf(args[1]).intValue();

        for (int k=0;k<threadCount;k++) {
            Thread t=new Thread(() -> {
                try (Context context = ZMQ.context(10);
                        Socket socket = context.socket(ZMQ.REQ);) {

                    socket.connect("tcp://localhost:5555");
                    for (int i=0;i<loopCount;i++) {
                        long tid = Thread.currentThread().getId();
                        String msg = "hello"+i;
                        System.out.println(tid+" sending "+msg);
                        socket.send(msg);
                        System.out.println(tid+" waiting response");
                        String s = socket.recvStr();
                        System.out.println(tid+" client received ["+s+"]");
                    }
                }
            });
            t.start();
        }

    }

}

_C# Server_

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using NetMQ;
using NetMQ.Sockets;

namespace TestServer1
{
    class TestServer
    {
        static void Main(string[] args)
        {
            int currentServCount = 0;
            using (NetMQContext ctx = NetMQContext.Create())
            {
                using (ResponseSocket server = ctx.CreateResponseSocket())
                {
                    server.Bind("tcp://localhost:5555");

                    while (true)
                    {
                        string msg = server.ReceiveString();
                        server.Send("Ack[" + msg + "]");
                        currentServCount++;

                        if (currentServCount % 1000 == 0)
                            Console.WriteLine("served " + currentServCount);
                    }
                }
            }
        }
    }
}

ordersinrandom avatar Dec 01 '15 09:12 ordersinrandom

Thank for your the reporting. I wasn't able to reproduce issue on my vmware windows7 32bit instance. Could please attach a heap dump by jcmd pid GC.heap_dump

miniway avatar Dec 07 '15 03:12 miniway

Not sure if this is related or not but we are experiencing something similar.

We have a ZMQ.PUSH socket that from time to time blocks in send(). This is the thread info I extracted from JConsole. This is a main thread in the app and should be RUNNING - I've changed the names to protect the innocent :):

Name: UpdatesThread Thread-6
State: RUNNABLE
Total blocked: 611  Total waited: 1,496

Stack trace: 
sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method)
sun.nio.ch.WindowsSelectorImpl$SubSelector.poll(WindowsSelectorImpl.java:296)
sun.nio.ch.WindowsSelectorImpl$SubSelector.access$400(WindowsSelectorImpl.java:278)
sun.nio.ch.WindowsSelectorImpl.doSelect(WindowsSelectorImpl.java:159)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
   - locked sun.nio.ch.Util$2@2ac28cb3
   - locked java.util.Collections$UnmodifiableSet@5de5a3d
   - locked sun.nio.ch.WindowsSelectorImpl@4fc49f29
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
zmq.Signaler.waitEvent(Signaler.java:148)
zmq.Mailbox.recv(Mailbox.java:105)
zmq.SocketBase.processCommands(SocketBase.java:828)
zmq.SocketBase.send(SocketBase.java:666)
org.zeromq.ZMQ$Socket.send(ZMQ.java:1302)
org.zeromq.ZMQ$Socket.send(ZMQ.java:1276)
com.xxx.yyy.zzz.updates.UpdatesListener$UpdatesThread.lambda$2(UpdatesListener.java:152)
com.xxx.yyy.zzz.updates.UpdatesListener$UpdatesThread$$Lambda$7/1076839376.accept(Unknown Source)
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
com.xxx.yyy.zzz.updates.UpdatesListener$UpdatesThread.run(UpdatesListener.java:15

This does not occur always, just from time to time, and it appeared after we have updated the JeroMQ jar from 0.3.4 to 0.3.5. When changing back to 0.3.4 we have no issues.

I'm on Window 8, CPU x64, JDK 1.8.0_65. On Linux this bug does not seem to reproduce, at least we weren't able to reproduce it; not sure if this is a Windows only thing though, just thought to mention it.

BogdanEM avatar Jan 28 '16 17:01 BogdanEM

I think I am seeing the same issue. With 0.3.5, a simple request/reply pattern hangs in the listening socket indefinitely on the first request, I downgraded to 0.3.4 everything works perfectly.

My environment is: VirtualBox on Windows 10 Lubuntu 15.10 Oracle JDK 1.8.0_91

stephanfr avatar Jul 13 '16 23:07 stephanfr

0.3.5 seems to be unstable in many ways. I'll look into the changes after 0.3.4. Thanks for the testing and reporting.

miniway avatar Jul 14 '16 00:07 miniway

Hey guys, I see v0.3.6 was released in end September. By any chance does it fix this issue too?

BogdanEM avatar Nov 18 '16 14:11 BogdanEM

I am hitting a similar issue. When I call Context.term() (or .close()), it hangs forever with the main thread in the below stack state. I had two sockets allocated off the context. I tried closing the sockets and context in different order, tried adding a brief sleep prior to closing context. Nothing worked to make the term() method return.

This is reproducible 100% of the time, and has forced us to omit the term() call, which exposes a memory leak. (In fact I started working on this after the java process hit OOM and I found thousands of context objects in the heap dump.)

I would like to know if there is an upcoming fix, or at least a reliable workaround to shut down JZMQ cleanly.

Main thread stack trace: "Test worker" #11 prio=5 os_prio=31 tid=0x00007fabc9ab0800 nid=0x4c03 runnable [0x000070000fb71000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method) at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198) at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0x000000077bde7ad8> (a sun.nio.ch.Util$3) - locked <0x000000077bde7ac8> (a java.util.Collections$UnmodifiableSet) - locked <0x000000077bde7928> (a sun.nio.ch.KQueueSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at zmq.Signaler.waitEvent(Signaler.java:148) zmq.Mailbox.recv(Mailbox.java:105) at zmq.Ctx.terminate(Ctx.java:197) at org.zeromq.ZMQ$Context.term(ZMQ.java:322)

lightningrob avatar Feb 11 '17 01:02 lightningrob

Action needed: implement the problem described above as a test case.

daveyarwood avatar Feb 11 '17 16:02 daveyarwood

The test with corresponding code was added in PR https://github.com/zeromq/jeromq/pull/435

fredoboulo avatar Aug 29 '17 19:08 fredoboulo

@ordersinrandom @BogdanEM @stephanfr @lightningrob Are you still experiencing this issue using the latest version of JeroMQ?

daveyarwood avatar Sep 18 '17 01:09 daveyarwood

We haven't updated to the latest. We are currently in the process of removing JeroMQ from our project.

BogdanEM avatar Sep 18 '17 06:09 BogdanEM

@daveyarwood we have tried the latest Jeromq 0.4.3 and we still experiencing this issue during heavy frequency usage. In fact 0.4.3 seems slightly worse than 0.3.5. The 'hang' used to happen after 1-2 weeks in a test environment with heavy usage. With 0.4.3 it happens after 4 hours.

Short term solution is to go back to 0.3.4 which also has an issue but it is much better than 0.3.5 an and then later, long term solution, we will most likely remove any JeroMQ usage from our codebase.

KjellKod avatar Feb 14 '18 16:02 KjellKod

Interesting that this issue seems to have been introduced between 0.3.4 and 0.3.5.

daveyarwood avatar Feb 14 '18 16:02 daveyarwood

@daveyarwood : the issue is there in 0.3.4 for sure. It's just that it is much, much worse in 0.3.5

KjellKod avatar Feb 14 '18 17:02 KjellKod

... and even worse in 0.4.3

KjellKod avatar Feb 14 '18 17:02 KjellKod

@KjellKod Can you describe your test environment? Windows? Linux? Java version(s)?

colini avatar Feb 14 '18 17:02 colini

Linux, java 1.8.0_77

KjellKod avatar Feb 14 '18 17:02 KjellKod

The test environment is basically that we use a playback system and pushing real data that was previously recorded on a customer system that had this issue.

Not sure what you want to know here. It definitely seems like it's more of a live lock issue than a dead lock issue. We seen sometimes tiny work being done once it gets in this state but it never truly recovers

KjellKod avatar Feb 14 '18 17:02 KjellKod