jpype icon indicating copy to clipboard operation
jpype copied to clipboard

JVM shutdown differences after 0.7.5

Open max3-2 opened this issue 4 years ago • 18 comments

Hi,

I'm using a JVM to work with another commercial application. The interfacing is baed on https://github.com/John-Hennig/MPh As stated in an issue there, since versions >0.7.5 the shutdownJVM() method hangs indefinitely after meaningful communication using the JVM. A quick description can be found here https://github.com/John-Hennig/MPh/issues/15 a longer discussion here https://github.com/John-Hennig/MPh/issues/1

I'm starting this here to "simply" ask if there was a major change between 0.7.5. and >1.0 (I assume that with the version skip) which influenced shutdownJVM(), especially regarding timeouts and / or graceful quits vs. terminations. Any idea would help right now!

Thanks!

max3-2 avatar Feb 11 '21 15:02 max3-2

Prior to 0.7.5 the shutdown method in JPype was a forced terminate of the JVM. There were no clean up routines executed at all. This prevented tools such as database flushes or code coverage tools from working properly.

Around 0.7.5 we reworked the shutdown to be correct. It now waits for all non daemon threads to terminate and calls all Java atexit routines. Unfortunately, this causes an issue with any tool that has problems with the normal shutdown sequence. Some tools create non-daemon threads without proper shutdown routines, others have deadlocks on processes termination, or in some cases a Python implementation of a Java class may be stuck on an IO call (Java can't propagate its shutdown to Python blocking code) .

Unfortunately, none of these are necessarily JPype bugs. They are bugs that existing in the code before that were being hidden by the brutal JPype shutdown routine. The old behavior can be gotten either by calling Java exit or Java terminate depending on where the deadlock is.

Thrameos avatar Feb 11 '21 17:02 Thrameos

Thanks for the quick reply - this actually explains it. The 3rd party software actually seems to rely on a System.exit() call which points to some threads not exiting well on their side. I am on to them but thats a commercial package and thus more complicated.

How would one call java exit or java terminate. Currently, in the abovementioned package this is implemented in a way that actually exits the interpreter (python). Is there a way to force the JVM down without quitting the interpreter, as shutdownJVM would do?

max3-2 avatar Feb 11 '21 17:02 max3-2

That is a good question. JPype pretty much marries the two processes together so the concept of continuing after the shutdown is usually less stable that one would hope. If I had my druthers I would never have exposed a shutdown hook in the first place. It was an endless source of bugs as the partially dead Java classes serve as landmines against continued operations.

I think that most of the available calls from within Java tend to drag the process down with it. You may be able to add a Java atexit routine which calls "Runtime.halt()". The shutdown sequence would start from within Python it hits the halt and the call should return control back to Python. But this is a very untested path.

https://www.baeldung.com/java-runtime-halt-vs-system-exit

Thrameos avatar Feb 11 '21 17:02 Thrameos

Sadly, halt does not return to the interpreter - it pretty much behaves like exit.

Let me think out loud here: Would it be possible to expose the old method using a flag of some sorts to get upgrade support and still have the old shutdown?

Background: I do know this is not good practice. But honestly, from my discussions with the commercial software they won't budge. At least not in a fairly acceptable timeframe.

This could be a debug mode or an argument to shutdownJVM(). Since usually this is not even exposed, exposing it using atexit would mean the user at least knows what he (she...) is doing. Also, since this is explicitly something atexit, landmining subsequent code would not be a big issue. Maybe there are other commercial packages (or will be) which behave in a similar way

max3-2 avatar Feb 12 '21 09:02 max3-2

I took the freedom and built a working version. Not pretty but a start. If you want I can do a PR and / or edit further

https://github.com/max3-2/jpype/tree/max3-2-enablingDebugShutdownFlag

max3-2 avatar Feb 12 '21 12:02 max3-2

So lets look at the options we can have on shutdown.

  1. Simply disabling the call to shutdown by adding a flag to cause shutdown to be a nop. Safe because then Python will terminate when it calls exit on the process.
  2. Attempting to unload without cleanup routines. Here we just skip Destroy call. The issue with this one is since there are still active threads running when we unload anything can happen as the call stack for the live threads disappears so instant crash. Worse since Java owns all the memory in C++ that means all Python object which Java referenced and the whole C++ layer leak.
  3. (old method) Attempting to ask JPype to remove those resources from a live JVM and then skip the destroy. This does prevent the leaking but not the race conditions. And it introduces a new set of race conditions as those resources may disappear while Java calls are still taking place.
  4. (current method) Jpype asks the JVM to destroy itself. All non-daemon threads are terminated so the majority of the active hooks are killed. We then clean up resources. Once the JVM is dead we unload it. (but this still isn't enough)
  5. JPype calls exit at the end of shutdown and is guaranteed never to return to Python. Shutdown now always terminates the process. This can be called just before the resources are killed. (safe because there are no race conditions) But if Python set an exit code we will lose it.

The only one that wont lead to a crash is option 1 and option 5. Option 4 is safer, but as you will see from the other issues (#934) it still isn't enough. The problem is although we got rid of the non-daemon threads, daemon threads can still cause a race condition with the resource deletion. It is pretty easy to demonstrate this. Create thread as a daemon that installs a @JImplements (called a proxy in Java) which is constantly calling back and forth to Java. When the shutdown goes, we will get to the point that the JVM says it is safe to proceed because nothing but non-daemon threads remain alive. Here is a race condition, we are removing the resources but that call may trigger. If it uses the resources that are being destroyed then it either crashes or tries to recreate the dead resource (which will free a resource twice) leading to crash, or it fails which throws an exception. If the exception manages to out race the collapsing building all is good. If it doesn't then Java is unloaded from memory, then the stack unwinds back into the Java thread which no longer exists leading to (you guessed it) crash.

So here is the crux of the problem. Java really doesn't have a safe way to unmarry itself from Python. If you use any code that links them together other than pure calling one way (Python to Java), then there are race conditions. The only way to avoid it is for either Python to call exit() without shutting down Java or Java calls exit without Python completing shutdown. Nothing in-between works without a race condition of one kind or another. Neither Python nor Java have cooperative ways of terminating all thread that have both Java and Python working together.

If we are going to add an option we need to have the user pick their poison from these three.

  1. Don't shutdown JVM and let Python control the exit.
  2. Attempt to leave Python alive after shutdown (which always has issues).
  3. Shutdown kills the process and never returns to Python.

I am guessing number 1 is the only thing that works for you.

Thrameos avatar Feb 12 '21 13:02 Thrameos

I think it has to be an option set before calling shutdown rather than to shutdown itself as shutdown can be reached either manually or as part of terminate.

Thrameos avatar Feb 12 '21 13:02 Thrameos

I do see this is more complicated. I don't think it will be perfect but keeping the current solution does work in most of the cases. The additional solution would be a failsafe with all the risks left to the user. This failsafe as in option one seems to be what I have done.

I think it has to be an option set before calling shutdown rather than to shutdown itself as shutdown can be reached either manually or as part of terminate.

I thought about that too, e.g. adding a config variable or so. However with the solution in my fork, the old, not so clean way is well hidden: The user has to register its python atexit overloading shutdownJVM() deliberately.

Edit: On second read, I see that you actually suggest skipping any shutdown at all - so not what I suggest in my fork. I do not have enough understanding of java to decide what its best here. I offer my help in implementing any solution but rely on your expertise here

max3-2 avatar Feb 12 '21 15:02 max3-2

Can you look over #937 to see if one of those options deals with this issue?

Thrameos avatar Feb 12 '21 18:02 Thrameos

First of all thanks for the work! I am always positively surprised and impressed with the development effort specifically on python packages!

I will but I guess it will be Monday since it’s evening / weekend here in Europe an I can only access the 3rd party from my work laptop which I left at the office. Maybe @John-Henning has the capabilities to run a quick Test Else we wait :)

max3-2 avatar Feb 12 '21 19:02 max3-2

No urgency on this one. I am trying to hit three birds with one stone on this by exposing the process as much as I can. So until I hear back from a few of them it will be hard to know which options are useful. Ultimately some of those options will likely be dropped. I am also working on a few other possible tweaks as there are still cases which such as race conditions that would be nice to fix (or at least try to mitigate).

Thrameos avatar Feb 12 '21 19:02 Thrameos

Well I managed my VPN to get going so I was able to run a quick test here. I unregistered any custom python atexit methods there were in MPh and run tests on your commit. I was able to confirm the following:

  1. Any method that quits or anytime I used SIGINT to stop python did not leave any traces in memory which I could find afterwards
  2. As expected by @john-henning its the destroy which causes the issue, thus setting this to False works as wanted
  3. Any other flags have no negative impact on quitting as far as i can see from my mediocre debugging
  4. Setting onexit to False overrides the behaviour described in 2) and quits with any setting on the other flags

Maybe something interesting (at least for me at this point): At the point where the python interpreter hangs, killing the java process via kill from the OS will remove java, but python still is locked. Only a SIGINT or killing python itself will free it once it is locked in the destroyJVM process.

Shutting down the JVM and keep on running python is nothing I have tested so far in detail. With very simple scripts, I do not see issued right away (using True everywhere but destroy, as described in 2) ). However I have not worked out a tester for such cases.

All the above is tested on UNIX (macOS10.15).

max3-2 avatar Feb 12 '21 21:02 max3-2

I can confirm that. If we use jpype.config.destroy_jvm = False, that seems to fix our issue. In that all tests pass without any shutdown delays. Details in John-Hennig/MPh#15.

Thanks for all your help on this, Thrameos.

john-hen avatar Feb 12 '21 22:02 john-hen

Can I (humbly) ask whats the progress on this one especially regrading #937 seems to be a valid fix..? While there are workarounds, none of them are very elegant.

max3-2 avatar May 04 '21 18:05 max3-2

I guess we need a 1.3 release. @marscher is this a good time?

Thrameos avatar May 07 '21 16:05 Thrameos

sounds good to me.

marscher avatar May 07 '21 16:05 marscher

I merged in the back log of merge requests. The azure system is giving me issues this morning so the state is less quality that I like, but I can resolve any remaining issues when I start the release as usual. I have one partially complete patch for forward declarations. I am going to put some time in and see if it can be completed for 1.3 else I push it to 1.4.

Thrameos avatar May 07 '21 16:05 Thrameos

Thank you so much for the work!

max3-2 avatar May 10 '21 06:05 max3-2