OMJulia.jl icon indicating copy to clipboard operation
OMJulia.jl copied to clipboard

ZMQ freezes on first command after session is created

Open CSchoel opened this issue 5 years ago • 7 comments

Sometimes (quite rarely), the first call to sendExpression() after an OMCSession is created freezes.

Stacktrace of InterruptException (after CTRL-C):

 [1] wait(::FileWatching._FDWatcher; readable::Bool, writable::Bool) at /build/julia/src/julia-1.5.0/usr/share/julia/stdlib/v1.5/FileWatching/src/FileWatching.jl:529
 [2] wait at /home/cslz90/.julia/packages/ZMQ/R3wSD/src/socket.jl:52 [inlined]
 [3] _recv!(::ZMQ.Socket, ::ZMQ.Message) at /home/cslz90/.julia/packages/ZMQ/R3wSD/src/comm.jl:75
 [4] recv at /home/cslz90/.julia/packages/ZMQ/R3wSD/src/comm.jl:94 [inlined]
 [5] sendExpression(::OMJulia.OMCSession, ::String) at /home/cslz90/.julia/packages/OMJulia/ZLXEs/src/OMJulia.jl:1014
 [6] setupOMCSession(::String, ::String; quiet::Bool, checkunits::Bool) at /home/cslz90/.julia/packages/ModelicaScriptingTools/G5LLK/src/ModelicaScriptingTools.jl:374

setupOMCSession is my own code which contains the following relevant lines with the second line being the one that shows up in the stacktrace:

omc = OMCSession()
sendExpression(omc, "cd(\"$(moescape(outdir))\")")

This happens with the release version 0.1.0 of OMJulia. I believe I have also encountered it with the current version from the master branch in the past, but I cannot confirm that since I have switched back to the official released version some time ago.

I will try to introduce a sleep for 100ms between the creation of the Session and the first sendExpression() and report back whether this workaround is successful.

CSchoel avatar Sep 08 '20 10:09 CSchoel

One additional note: Together with #32 one might get the impression that perhaps any sendExpression() call might freeze, but across several hundred test runs over the last months, I never encountered a freeze between individual simulations, but only at the very start or at the end of the pipeline.

CSchoel avatar Sep 08 '20 10:09 CSchoel

Update: I gradually increased the timeout from 100 ms to 500 ms, but still got occasional hangups. My next best guess is this suggestion from a related issue in ZMQ.jl: https://github.com/JuliaInterop/ZMQ.jl/issues/87#issuecomment-131153884

function avoidStartupFreeze(omc:: OMCSession)
    status = :started
    timeout = 0.1
    while status != :received
        # send a simple command to OMC
        send(omc.socket, "getVersion()")
        # use julia task to allow recv to run into a timeout
        c = Channel()
        @async put!(c, (recv(omc.socket), :received));
        @async (sleep(timeout); put!(c, (nothing, :timedout));)
        data, status = take!(c)
        if status == :timedout
            @warn("getVersion() timed out in avoidStartupFreeze")
        end
    end
end

This sends getVersion() to the OMC until an answer is received in less than 100 ms. I am not sure if this (rather crude) timeout mechanism will work if ZMQ freezes as the issue is not reliably reproducible. I will report back when I encounter a case where the warning message is issued.

CSchoel avatar Nov 03 '20 16:11 CSchoel

Update can be found here: https://github.com/THM-MoTE/ModelicaScriptingTools.jl/issues/9

The solution avoids freezes, but ZMQ crashes with a ZMQ.StateError.

CSchoel avatar Nov 12 '20 17:11 CSchoel

Another update: I have now improved the function avoidStartupFreeze to a point where it simply discards the whole OMCSession and creates a new one when a timeout is detected.

function avoidStartupFreeze(omc:: OMCSession)
    function reconnect(omc:: OMCSession)
        try
            send(omc.socket, "quit()")
        catch e
        end
        return OMCSession()
    end
    status = :started
    timeout = 0.1
    while status != :received
        # send a simple command to OMC
        send(omc.socket, "getVersion()")
        # use julia task to allow recv to run into a timeout
        # idea from https://github.com/JuliaInterop/ZMQ.jl/issues/87#issuecomment-131153884
        c = Channel()
        @async put!(c, (recv(omc.socket), :received));
        @async (sleep(timeout); put!(c, (nothing, :timedout));)
        data, status = take!(c)
        if status == :timedout
            omc = reconnect(omc)
        end
    end
    return omc
end

So far this works great, although it is more a workaround rather than a solution.

CSchoel avatar Nov 19 '20 16:11 CSchoel

@CSchoel thank you for that workaround. I also observed the startup freeze, but additionally have problems when running thousands of simulations in a row - at some point the communication fails.

ghost avatar Mar 31 '21 09:03 ghost

@DarkVador42 you're welcome. I am happy that it could be of help to someone else. :smile:

Is your error by any chance related to a ZMQ.StateError? This is the only additional problem that I encountered with this method and it only occurs during the creation of an OMCSession instance. I use a very crude solution for this which just recreates the session until there is no error and up until now it works. :shrug:

CSchoel avatar Apr 09 '21 17:04 CSchoel

@CSchoel, yes, it also happens regularly when I create the OMCSession. Apart from that it also froze when I had thousands of model calls, where it was trapped inside a "wait" function of ZMQ - I cannot be more specific here, since I was not able to reproduce the error...

ghost avatar Apr 12 '21 07:04 ghost