jeromq
jeromq copied to clipboard
Simple Request/Response server side hangs at rc = selector.select(0); in Signaler.waitEvent()
I have written a pair of simple Request / Response to try load testing the performance of JEROMQ. Tried compiling them with JDK 1.8 on 64bit Windows 7.
If I just start the server, and launch a client using two threads (i.e.) first parameter 2 second parameter let's say 50000 for the _TestRequestResponseClient1_ class.
The server _TestRequestResponseServer1_ occassionally waits forever at the Java NIO line of code selector.select(0) at the jeromq internal _Signaler_ class.
Or alternatively if I start two separate clients using single thread it also fails in the same way.
Once the server hits this issue, it does not go away without server side restart. If I just restart all clients, the server still stucks at the same place not detecting any incoming network signal and not responding to any clients. This renders the JEROMQ unusable unless we can guarantee single client and single server setup.
This only happens on version 0.3.5. Using an older version 0.3.4 it works perfectly. Perhaps some refactoring bugs introduced in the new release?
I also tried re-writing the server in C# using NetMQ and connect to it using the same Java client and there is no issue. Thus I am quite sure it is related to some new hidden server side NIO bug in JEROMQ 0.3.5.
public boolean waitEvent(long timeout)
{
int rc = 0;
try {
if (timeout == 0) {
// waitEvent(0) is called every read/send of SocketBase
// instant readiness is not strictly required
// On the other hand, we can save lots of system call and increase performance
return rcursor < wcursor.get();
}
else if (timeout < 0) {
rc = selector.select(0);
}
_Server_
package com.mqtest;
import org.zeromq.ZMQ;
import org.zeromq.ZMQ.Context;
import org.zeromq.ZMQ.Socket;
public class TestRequestResponseServer1 {
public static void main(String[] args) {
int currentServCount = 0;
try (Context context = ZMQ.context(1);) {
try (Socket responder = context.socket(ZMQ.REP);) {
String address = "tcp://localhost:5555";
responder.bind(address);
System.out.println("Server bound at " + address);
while (!Thread.currentThread().isInterrupted()) {
try {
String incomingMessage = responder.recvStr();
responder.send("Server Replied [" + currentServCount
+ "/" + Thread.currentThread().getId()
+ "] of " + incomingMessage);
currentServCount++;
if (currentServCount % 1000 == 0)
System.out.println("Served " + currentServCount);
} catch (Exception e2) {
System.out.println(">>>>>>>>>>got exception "
+ e2.getMessage());
}
}
System.out.println("server quitting.. interrupted");
responder.close();
}
}
}
}
_Client_
package com.mqtest;
import org.zeromq.ZMQ;
import org.zeromq.ZMQ.Context;
import org.zeromq.ZMQ.Socket;
public class TestRequestResponseClient1 {
public static void main(String[] args) throws Exception {
int threadCount = Integer.valueOf(args[0]).intValue();
int loopCount = Integer.valueOf(args[1]).intValue();
for (int k=0;k<threadCount;k++) {
Thread t=new Thread(() -> {
try (Context context = ZMQ.context(10);
Socket socket = context.socket(ZMQ.REQ);) {
socket.connect("tcp://localhost:5555");
for (int i=0;i<loopCount;i++) {
long tid = Thread.currentThread().getId();
String msg = "hello"+i;
System.out.println(tid+" sending "+msg);
socket.send(msg);
System.out.println(tid+" waiting response");
String s = socket.recvStr();
System.out.println(tid+" client received ["+s+"]");
}
}
});
t.start();
}
}
}
_C# Server_
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using NetMQ;
using NetMQ.Sockets;
namespace TestServer1
{
class TestServer
{
static void Main(string[] args)
{
int currentServCount = 0;
using (NetMQContext ctx = NetMQContext.Create())
{
using (ResponseSocket server = ctx.CreateResponseSocket())
{
server.Bind("tcp://localhost:5555");
while (true)
{
string msg = server.ReceiveString();
server.Send("Ack[" + msg + "]");
currentServCount++;
if (currentServCount % 1000 == 0)
Console.WriteLine("served " + currentServCount);
}
}
}
}
}
}
Thank for your the reporting. I wasn't able to reproduce issue on my vmware windows7 32bit instance. Could please attach a heap dump by jcmd pid GC.heap_dump
Not sure if this is related or not but we are experiencing something similar.
We have a ZMQ.PUSH socket that from time to time blocks in send(). This is the thread info I extracted from JConsole. This is a main thread in the app and should be RUNNING - I've changed the names to protect the innocent :):
Name: UpdatesThread Thread-6
State: RUNNABLE
Total blocked: 611 Total waited: 1,496
Stack trace:
sun.nio.ch.WindowsSelectorImpl$SubSelector.poll0(Native Method)
sun.nio.ch.WindowsSelectorImpl$SubSelector.poll(WindowsSelectorImpl.java:296)
sun.nio.ch.WindowsSelectorImpl$SubSelector.access$400(WindowsSelectorImpl.java:278)
sun.nio.ch.WindowsSelectorImpl.doSelect(WindowsSelectorImpl.java:159)
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
- locked sun.nio.ch.Util$2@2ac28cb3
- locked java.util.Collections$UnmodifiableSet@5de5a3d
- locked sun.nio.ch.WindowsSelectorImpl@4fc49f29
sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
zmq.Signaler.waitEvent(Signaler.java:148)
zmq.Mailbox.recv(Mailbox.java:105)
zmq.SocketBase.processCommands(SocketBase.java:828)
zmq.SocketBase.send(SocketBase.java:666)
org.zeromq.ZMQ$Socket.send(ZMQ.java:1302)
org.zeromq.ZMQ$Socket.send(ZMQ.java:1276)
com.xxx.yyy.zzz.updates.UpdatesListener$UpdatesThread.lambda$2(UpdatesListener.java:152)
com.xxx.yyy.zzz.updates.UpdatesListener$UpdatesThread$$Lambda$7/1076839376.accept(Unknown Source)
java.util.stream.ForEachOps$ForEachOp$OfRef.accept(ForEachOps.java:184)
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1374)
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151)
java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174)
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418)
com.xxx.yyy.zzz.updates.UpdatesListener$UpdatesThread.run(UpdatesListener.java:15
This does not occur always, just from time to time, and it appeared after we have updated the JeroMQ jar from 0.3.4 to 0.3.5. When changing back to 0.3.4 we have no issues.
I'm on Window 8, CPU x64, JDK 1.8.0_65. On Linux this bug does not seem to reproduce, at least we weren't able to reproduce it; not sure if this is a Windows only thing though, just thought to mention it.
I think I am seeing the same issue. With 0.3.5, a simple request/reply pattern hangs in the listening socket indefinitely on the first request, I downgraded to 0.3.4 everything works perfectly.
My environment is: VirtualBox on Windows 10 Lubuntu 15.10 Oracle JDK 1.8.0_91
0.3.5 seems to be unstable in many ways. I'll look into the changes after 0.3.4. Thanks for the testing and reporting.
Hey guys, I see v0.3.6 was released in end September. By any chance does it fix this issue too?
I am hitting a similar issue. When I call Context.term() (or .close()), it hangs forever with the main thread in the below stack state. I had two sockets allocated off the context. I tried closing the sockets and context in different order, tried adding a brief sleep prior to closing context. Nothing worked to make the term() method return.
This is reproducible 100% of the time, and has forced us to omit the term() call, which exposes a memory leak. (In fact I started working on this after the java process hit OOM and I found thousands of context objects in the heap dump.)
I would like to know if there is an upcoming fix, or at least a reliable workaround to shut down JZMQ cleanly.
Main thread stack trace: "Test worker" #11 prio=5 os_prio=31 tid=0x00007fabc9ab0800 nid=0x4c03 runnable [0x000070000fb71000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.KQueueArrayWrapper.kevent0(Native Method) at sun.nio.ch.KQueueArrayWrapper.poll(KQueueArrayWrapper.java:198) at sun.nio.ch.KQueueSelectorImpl.doSelect(KQueueSelectorImpl.java:117) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <0x000000077bde7ad8> (a sun.nio.ch.Util$3) - locked <0x000000077bde7ac8> (a java.util.Collections$UnmodifiableSet) - locked <0x000000077bde7928> (a sun.nio.ch.KQueueSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at zmq.Signaler.waitEvent(Signaler.java:148) zmq.Mailbox.recv(Mailbox.java:105) at zmq.Ctx.terminate(Ctx.java:197) at org.zeromq.ZMQ$Context.term(ZMQ.java:322)
Action needed: implement the problem described above as a test case.
The test with corresponding code was added in PR https://github.com/zeromq/jeromq/pull/435
@ordersinrandom @BogdanEM @stephanfr @lightningrob Are you still experiencing this issue using the latest version of JeroMQ?
We haven't updated to the latest. We are currently in the process of removing JeroMQ from our project.
@daveyarwood we have tried the latest Jeromq 0.4.3 and we still experiencing this issue during heavy frequency usage. In fact 0.4.3 seems slightly worse than 0.3.5. The 'hang' used to happen after 1-2 weeks in a test environment with heavy usage. With 0.4.3 it happens after 4 hours.
Short term solution is to go back to 0.3.4 which also has an issue but it is much better than 0.3.5 an and then later, long term solution, we will most likely remove any JeroMQ usage from our codebase.
Interesting that this issue seems to have been introduced between 0.3.4 and 0.3.5.
@daveyarwood : the issue is there in 0.3.4 for sure. It's just that it is much, much worse in 0.3.5
... and even worse in 0.4.3
@KjellKod Can you describe your test environment? Windows? Linux? Java version(s)?
Linux, java 1.8.0_77
The test environment is basically that we use a playback system and pushing real data that was previously recorded on a customer system that had this issue.
Not sure what you want to know here. It definitely seems like it's more of a live lock issue than a dead lock issue. We seen sometimes tiny work being done once it gets in this state but it never truly recovers