jeromq icon indicating copy to clipboard operation
jeromq copied to clipboard

How to properly handle ZError from createSocket(int type) in Ctx.java

Open trumpetmonk opened this issue 7 years ago • 8 comments

I'm going to keep this intentionally very generic because I am looking for generic advice.

Our program has a zmq context. When createSocket(int type) is called for the first time on that context, the code sets the starting variable to false and does some set-up work for the context. An array of mailboxes is created of which the reaper thread and the ioThreads for the context are instantiated. During this instantiation, a java.io.IOException SocketTimeoutException occurs. The Zmq code catches this exception in the constructor for Signaler and rethrows it as a ZError.IOException.

However, because the context has had its starting variable set to false, the context never gets set up with any empty slots (and potentially no reaper or io threads depending on where the exception was thrown). This effectively makes the context unusable.

Question is in the title. What is the recommended course of action for handling this exception?

trumpetmonk avatar Sep 28 '18 17:09 trumpetmonk

If I understand correctly, exception is thrown either when instantiating the reaper or the IO threads. If that's the case, I fail to see how the context can be of any use, as these critical components cannot perform correctly.

Can you provide a stacktrace, so we can have a better look at the situation? Which version of jeromq are you using?

fredoboulo avatar Sep 29 '18 07:09 fredoboulo

We are currently using version 3.4. I know it is a little old at this point, but I was not part of that decision process. Here is the relevant part of the stack trace:

zmq.ZError$IOException: java.net.SocketTimeoutException: socket timeout at zmq.Signaler. (Unknown Source, bco=217) at zmq.Mailbox. (Unknown Source, bco=107) at zmq.IOThread. (Unknown Source, bco=168) at zmq.Ctx.create_socket (Unknown Source, bco=405) at org.zeromq.ZMQ$Socket. (Unknown Source, bco=47) ...[Our Application code calling new ZMQ.Socket(ctx, REQ]

This exception was occurring frequently enough that we tried catching it and retrying the socket creation. In some cases, this worked fine and the socket would get created and our message(s) could be sent. In the worst case, however, this occurred on the very first socket created on a context and thus the context never got set up correctly (as in the above trace). Retrying to create a socket in this case leads to an EMFILE illegal state exception being thrown as per the context check for emptySlots which, obviously, never got set up because the function bailed out with the exception before it got to that point.

Although the stack trace ends with the zmq.Signaler, going deeper I could see the following: at SelectorProvider.provider().openSelector(); at Selector.open();

However, the implementation of openSelector() is a black box to me and I cannot see our JVM's implementation of that function. We are currently in communication with our JVM provider as well with this issue to try and understand WHY the exception is being thrown to begin with.

In the meantime, is there a recommended way to handle this exception or is the system designed to fail (crash) at this point?

trumpetmonk avatar Oct 01 '18 15:10 trumpetmonk

Creation and use of a selector in Signaler class is a critical step in the whole system. If that happens for the first socket, it could as well happen in the IOThread or Reaper signalers. In any case, I do not see how the library is supposed to recover from such error. Others may have a better opinion on the topic.

Which entity is providing this JVM? I never encountered such an error in the past with Oracle or OpenJDK JVM.

fredoboulo avatar Oct 06 '18 10:10 fredoboulo

We are currently using version 3.4.

@trumpetmonk A number of bugs have been fixed since then. Is the issue you're seeing present with the latest version of JeroMQ?

daveyarwood avatar Oct 28 '18 02:10 daveyarwood

So, there were a couple interesting things going.

  1. Digging through the code, I found that there were multiple contexts created at bootup. This was a fault in our application and refactoring the code to create and use a single context (as per your recommendations) made a big difference in reducing (eliminating? Not enough data yet to prove this) the failure to create the context resources.

  2. We are still seeing socket timeout exceptions when attempting to create subsequent sockets on a context occasionally. It is fundamentally the same error and stack trace as I've mentioned above, but because the context has already initialized its key resources, I am able to handle this exception gracefully in my application. I believe this is merely a workaround, though.

As for updating the version: I looked through some of the change sets surrounding the socket creation on the context and I did see several changes related to locks and concurrency, especially in regards to creating some of the underlying resources. I have reason to believe that these changes may positively affect our program. However, there was some difficulty going from 3.4 to 4.x and we could only reasonably go to 3.6 at this time. Our organization is looking into this avenue further.

I will update this as I continue to learn more.

trumpetmonk avatar Oct 30 '18 15:10 trumpetmonk

FWIW, a new version, 0.5.0 should be out soon, which might hopefully fix whatever is keeping you from upgrading beyond 0.3.6.

daveyarwood avatar Oct 30 '18 15:10 daveyarwood

I will keep an eye out for it and we'll reassess upgrading during our next release cycle (we're deep into one currently, so, massive changes at this point are typically not great).

Thank you for all the assistance.

trumpetmonk avatar Nov 02 '18 13:11 trumpetmonk

Totally understandable. Thanks for your patience!

daveyarwood avatar Nov 02 '18 14:11 daveyarwood