jeromq icon indicating copy to clipboard operation
jeromq copied to clipboard

NullPointerException: Cannot invoke "zmq.IMailbox.send(zmq.Command)" because "this.slots[tid]" is null

Open inad9300 opened this issue 10 months ago • 4 comments

Using version 0.5.4, after having interrupted a thread in which an open subscription was running, followed by calls to ZMonitor.close() and ZContext.close(), I got the following NPE (the first exception is to help understand the context):

Exception in thread "Thread-244" org.zeromq.ZMQException: Errno 4 : Interrupted function
        at org.zeromq.ZMQ$Socket.mayRaise(ZMQ.java:3732)
        at org.zeromq.ZMQ$Socket.recv(ZMQ.java:3530)
        at org.zeromq.ZMQ$Socket.recv(ZMQ.java:3502)
        ...
        at java.base/java.lang.Thread.run(Thread.java:840)

...

java.lang.NullPointerException: Cannot invoke "zmq.IMailbox.send(zmq.Command)" because "this.slots[tid]" is null
        at zmq.Ctx.sendCommand(Ctx.java:615)
        at zmq.ZObject.sendCommand(ZObject.java:410)
        at zmq.ZObject.sendPipeTermAck(ZObject.java:260)
        at zmq.pipe.Pipe.processPipeTermAck(Pipe.java:421)
        at zmq.ZObject.processCommand(ZObject.java:91)
        at zmq.Command.process(Command.java:79)
        at zmq.SocketBase.processCommands(SocketBase.java:1198)
        at zmq.SocketBase.inEvent(SocketBase.java:1365)
        at zmq.poll.Poller.run(Poller.java:276)
        at java.base/java.lang.Thread.run(Thread.java:840)

It is worth noting that this exception is a rare occurrence, having shown up only after many similar executions of the same code.

inad9300 avatar Mar 26 '24 15:03 inad9300

Did you try with release 0.6.0 ?

fbacchella avatar Mar 26 '24 16:03 fbacchella

I confirm this exception can occur in 0.6.0 (this happens sometimes in a scenario like the one described in https://github.com/zeromq/jeromq/issues/984; both issues may be due to the same underlying problem):

Exception in thread "ZMonitor-Sub[56]" java.lang.NullPointerException: Cannot invoke "zmq.IMailbox.send(zmq.Command)" because "this.slots[tid]" is null
        at zmq.Ctx.sendCommand(Ctx.java:662)
        at zmq.ZObject.sendCommand(ZObject.java:410)
        at zmq.ZObject.sendReapAck(ZObject.java:290)
        at zmq.SocketBase.processCommands(SocketBase.java:1183)
        at zmq.SocketBase.send(SocketBase.java:854)
        at zmq.SocketBase.send(SocketBase.java:792)
        at org.zeromq.ZMQ$Socket.send(ZMQ.java:3445)
        at org.zeromq.ZMQ$Socket.send(ZMQ.java:3359)
        at org.zeromq.ZStar$Plateau.run(ZStar.java:503)
        at org.zeromq.ZThread$ShimThread.run(ZThread.java:57)

inad9300 avatar Mar 26 '24 18:03 inad9300

I had an NPE with version 0.5.4 at the same line as in the OP, but with a different stack trace. It turned out that due to a race condition I was calling CancellationToken::cancel after the socket-owning thread had closed the socket, so the fault was in my code after all (OTOH, should cancel() on a closed socket really throw a NPE?).

That said, I'm not sure if the cancel() code really is correct. The real issue here is that access to slot[tid] is not synchronized properly AFAICS. I guess that's the main reason why the documentation clearly says that a socket should only ever be used by the thread that created it, but the cancellation token deliberately breaks that thread boundary and therefore requires proper synchronization.

pmconrad avatar Jul 16 '24 15:07 pmconrad

I had an NPE with version 0.5.4 at the same line as in the OP, but with a different stack trace. It turned out that due to a race condition I was calling CancellationToken::cancel after the socket-owning thread had closed the socket, so the fault was in my code after all (OTOH, should cancel() on a closed socket really throw a NPE?).

That said, I'm not sure if the cancel() code really is correct. The real issue here is that access to slot[tid] is not synchronized properly AFAICS. I guess that's the main reason why the documentation clearly says that a socket should only ever be used by the thread that created it, but the cancellation token deliberately breaks that thread boundary and therefore requires proper synchronization.

If that's the case then cancel() usage should be discouraged and deprecated for reasons you described. Better would be to use a pattern that's officially supported. E.g. Send a shutdown command to the socket/thread via the same socket or a different command channel.

trevorbernard avatar Jul 16 '24 17:07 trevorbernard