WolframClientForPython icon indicating copy to clipboard operation
WolframClientForPython copied to clipboard

socket failures that take hours to heal

Open stopatz opened this issue 3 years ago • 0 comments

I use a Wolfram session to compute the integrand in the Vegas algorithm in Python.

I use MPI to call a session in each core on a high-performance cluster.

Before I start a session, I want to kill any floating Mathematica processes, so I use the kernelcontroller as follows:

controller = kernelcontroller.WolframKernelController(kernel='path', kernel_loglevel=1)

controller._kernel_stop()

Now, if I wait 10 minutes after this clean-up, my actual code

with WolframLanguageSession('path') as session:...

works fine most of the time.

But at seemingly random times, I get socket failures when I run the two-step process (cleanup, then run session), with multiple instances of the following error message:

Socket exception: Failed to read any message from socket tcp://127.0.0.1:39237 after 20.0 seconds and 199 retries. Failed to start. Traceback (most recent call last): File "/home/sjsuh/anaconda3/lib/python3.9/site-packages/wolframclient/evaluation/kernel/kernelcontroller.py", line 435, in _kernel_start response = self.kernel_socket_in.recv_abortable( File "/home/sjsuh/anaconda3/lib/python3.9/site-packages/wolframclient/evaluation/kernel/zmqsocket.py", line 53, in recv_abortable raise SocketOperationTimeout( wolframclient.evaluation.kernel.zmqsocket.SocketOperationTimeout: Failed to read any message from socket tcp://127.0.0.1:39237 after 20.0 seconds and 199 retries.

Now, to be able to run my code again, I find that I have to wait around 3 hours and run my routine. Otherwise, this socket failure persists.

So my questions are i) is there a better way to kill stray processes than what I have used, ii) why am I getting the socket failures, and is there a way to heal the socket failures faster?

stopatz avatar Apr 02 '22 07:04 stopatz