jeromq
jeromq copied to clipboard
Background IOThread appears to be stuck in handle_connect
I am seeing an issue where it appears we have a background zmq iothread getting stuck which prevents subsequent zmq requests from getting sent. Our program is set up currently where we have one single context. Using that context, we create multiple servers and clients for communication with other processes. Intermittently we see this problem occur where all zmq communication in/out of our program stops working. Looking at the thread dump of the program in this state and I believe the iothread is stuck.
Before the problem occurs, while program is idle the iothread looks something like the following as it waits for events to occur:
--- Java Stack for iothread-2: at java/net/SelectorImpl.selectInternal(J)I (:82:0x150 <JIT>) at java/net/SelectorImpl.select(J)I (:0x00000056 <JIT>) at zmq/Poller.run()V (:180:0x1fc <JIT>) at java/lang/Thread.run()V (:11:0x22 <JIT>)
When the problem occurs, I notice that the iothread now has a stack trace which appears to be processing a connect event but is stuck in accept() method on the socket and never advances.
--- Java Stack for iothread-2:
at java/net/PlainSocketImpl.accept(Ljava/net/SocketImpl;)V (:5:0x1c <JIT>)
at java/net/ServerSocket.implAccept(Ljava/net/Socket;)V (:35:0x8d <JIT>)
at java/net/ServerSocketChannelImpl$ServerSocketAdapter.accept(Ljava/net/Socket;Ljava/nio/channels/SocketChannel;)Ljava/net/Socket; (:17:0x76 <JIT>)
at java/net/ServerSocketChannelImpl$ServerSocketAdapter.access$000(Ljava/net/ServerSocketChannelImpl$ServerSocketAdapter;Ljava/net/Socket;Ljava/nio/channels/SocketChannel;)Ljava/net/Socket; (:0x00000014 <JIT>)
at java/net/ServerSocketChannelImpl.accept()Ljava/nio/channels/SocketChannel; (:89:0x121 <JIT>)
at java/net/PipeImpl$SourceChannelImpl.accept()V (:11:0x1c <JIT>)
at java/net/PipeImpl$1.run()Ljava/lang/Void; (:18:0x39 <JIT>)
at java/net/PipeImpl$1.run()Ljava/lang/Object; (:0x00000017 <JIT>)
at java/security/AccessController.callPrivilegedExceptionAction(Ljava/security/PrivilegedExceptionAction;)Ljava/lang/Object; (:0x00000018 <JIT>)
at java/security/AccessController.doPrivileged(Ljava/security/PrivilegedExceptionAction;)Ljava/lang/Object; (:0x0000000B <JIT>)
at java/net/PipeImpl.
Due to this all ZMQ communication in or out via clients/servers under this context fail with timeouts. Are you aware of or have any insight as to why this might be happening? We are currently using jeromq version 0.3.4.
Version 0.3.4 is very old (May 2014)! Can you try the latest version and see if the issue is still there?
I tried updating to version 0.5.1 and was able to reproduce the problem again.
New stack trace for the iothread.
--- Java Stack for iothread-2:
at java/net/PlainSocketImpl.accept(Ljava/net/SocketImpl;)V (:5:0x1c <JIT>)
at java/net/ServerSocket.implAccept(Ljava/net/Socket;)V (:35:0x8d <JIT>)
at java/net/ServerSocketChannelImpl$ServerSocketAdapter.accept(Ljava/net/Socket;Ljava/nio/channels/SocketChannel;)Ljava/net/Socket; (:17:0x76 <JIT>)
at java/net/ServerSocketChannelImpl$ServerSocketAdapter.access$000(Ljava/net/ServerSocketChannelImpl$ServerSocketAdapter;Ljava/net/Socket;Ljava/nio/channels/SocketChannel;)Ljava/net/Socket; (:0x00000014 <JIT>)
at java/net/ServerSocketChannelImpl.accept()Ljava/nio/channels/SocketChannel; (:89:0x121 <JIT>)
at java/net/PipeImpl$SourceChannelImpl.accept()V (:11:0x1c <JIT>)
at java/net/PipeImpl$1.run()Ljava/lang/Void; (:18:0x39 <JIT>)
at java/net/PipeImpl$1.run()Ljava/lang/Object; (:0x00000017 <JIT>)
at java/security/AccessController.callPrivilegedExceptionAction(Ljava/security/PrivilegedExceptionAction;)Ljava/lang/Object; (:0x00000018 <JIT>)
at java/security/AccessController.doPrivileged(Ljava/security/PrivilegedExceptionAction;)Ljava/lang/Object; (:0x0000000B <JIT>)
at java/net/PipeImpl.
I poked around a bit to see if I could understand why this would happen, and I don't, sorry. Hopefully someone else can weigh in?
Can you provide a simple code that reflects the problem? Some details about your environment (OS, java version, ...) could be helpful as well
Linux OS. java version 1.7. I have not been able to isolate a particular test client to reliably recreate the problem yet. I am only able to recreate within our entire system by running a particular workflow over and over until it occurs.
Our system uses multiple zmq clients for messaging to other processes. It also has a handful of zmq servers created to listen for messages from other processes in the system. Both the clients and servers are created using the same zmq context. We also have some other servers running that operate using java serverSockets which are not using zmq.
The particular workflow I am using to recreate this involves some messages going out over zmq clients as well as some messages coming in through the zmq servers. In addition to this, we have a separate non zmq server getting created to process some http requests. This server binds a serverSocket an arbitrary port, then starts a thread to listen for connections. When it receives one, it will spin up a new thread to process the http request. At the end of the workflow the server is shutdown.
I added some logging to the non-zmq server code to log the local and remote ports of the sockets returned from its serverSocket.accept() method. I found that right at the time the problem occurs, somehow the serverSocket.accept() returns a socket with an entirely different local port than what the serverSocket was bound to. I dug through the heap dump, it looks like the zmq iothread that is stuck is holding a socket which has the same local port as the mysterious socket returned from the accept() method in the non-zmq server. So is it that somehow the non-zmq server "stole" the connection that the zmq iothread was expecting? I dont exactly know how to explain this yet. Note that this mysterious socket is returned from the serverSocket.accept() method right at the time that we are calling close() on the serverSocket. In a normal case, this would cause the thread blocking on accept() to throw a SocketException, but that is not happening here.
That would be hard without an example in hands...
Do I understand it correctly by assuming there are zmq-zmq connections and non-zmq-zmq connections? A more in-depth information about your topology would be nice.
Are you using docker?
Apologies for not having an example, have been unable to recreate with an isolated test up to this point.
As for your question, we have zmq to zmq connections, and non-zmq to non-zmq connections in our system. It appears that there is a race condition such that when the java/net/ServerSocket is closed, the thread blocking on the ServerSocket.accept() returns a Socket intended for a different local port rather than throwing the expected SocketException as it should.
We are not using docker.
I will see if I can get an isolated program put together that reproduces the problem to share with you. For now, as a workaround I have made changes to prevent repeatedly creating, starting and stopping the non-zmq server to reduce possibility of this issue occurring in our system.