libpython-clj icon indicating copy to clipboard operation
libpython-clj copied to clipboard

libpython-clj blocks occasionaly using python simpletransformers library

Open behrica opened this issue 3 years ago • 35 comments

It is a kind of FUP of #194

Similar code, same python library "simpletransformers"

But as I don't see a crash, but a deadlock, I think it is better to have a new issue. Maybe they are related.

I have made a rminimal reproducible scenario here, including the data.

https://github.com/behrica/libpython-clj-219

behrica avatar Sep 25 '22 09:09 behrica

I would say, it hangs 1 out of 10 runs.

I think I can increase the likelihood of hanging by enabling the progress printing of the model: https://github.com/behrica/libpython-clj-219/blob/65fcfb50fe1368b165e8fa8a1903f404b8271989/train.clj#L27

behrica avatar Sep 25 '22 09:09 behrica

Not sure, if using CUDA or not, https://github.com/behrica/libpython-clj-219/blob/65fcfb50fe1368b165e8fa8a1903f404b8271989/train.clj#L26 makes a difference, I saw the issue with and without GPU usage.

behrica avatar Sep 25 '22 09:09 behrica

I did not have the impression, that changing the "amount" of training data had an influence. https://github.com/behrica/libpython-clj-219/blob/65fcfb50fe1368b165e8fa8a1903f404b8271989/train.clj#L19

I never hangs in "the middle of teh training", but in the "beginning", it seems.

behrica avatar Sep 25 '22 09:09 behrica

Running the run.sh should produce the deadlock reliably, I think: https://github.com/behrica/libpython-clj-219/blob/main/run.sh

behrica avatar Sep 25 '22 09:09 behrica

I added a script which counts, how often the invocations hangs. It uses the linux timeout command and logs the signals to a file. Signal = 0 , -> training worked signal 137, it times out after 60 seconds (for me it finishes always around 30 s, if not hanging)

https://github.com/behrica/libpython-clj-219/blob/main/countHang.sh

behrica avatar Sep 25 '22 10:09 behrica

with that I have "typicaly", 1 out of 10 failing

behrica avatar Sep 25 '22 10:09 behrica

The "issue" is far less frequent , when I disable "io-redirect" in libpython.clj have it enabled makes : 16 / 100 failed have it disabled : 1/ 100 failed

I was always suspecting that the issue is somehow related to the fact if and how the python processes print to console.

behrica avatar Sep 25 '22 12:09 behrica

If torch is using the multi threading module that will definitely cause it. We know in embedded mode we see those error messages from it…

cnuernber avatar Sep 25 '22 13:09 cnuernber

But is there anythng we can do about it ? Is the io-redirect maybe "too simple" ? I saw some blog posts about redirect stdout from C extensions, and they are more involved then what we do. https://eli.thegreenplace.net/2015/redirecting-all-kinds-of-stdout-in-python/

behrica avatar Sep 25 '22 14:09 behrica

I think redirecting stdout is a red herring - turning off torch multithreafing I think is key

cnuernber avatar Sep 25 '22 20:09 cnuernber

If we want to be bold we could perhaps provide our own implementation of the threading module.

cnuernber avatar Sep 25 '22 20:09 cnuernber

Actually, in embedded mode with manual gil locking you should not see any issues like this. The multithreading module works in embedded mode. And you are right, the io redirection we do is at the python library level, not the C level. But again I think that is not why things are hanging although if a buffer is getting full and not flushed maybe.

Remember in embedded mode you don't need to acquire/release the GIL at all. The python layer above you has done all of this so really in embedded mode manual gil pathways should be the norm.

cnuernber avatar Sep 25 '22 20:09 cnuernber

Ok, I try that.

I disabled as well torch multithreading, this seems to make issues in some python environments even, like Colab

What is the way to use properly the manual gil pathway from Clojure ? Calling the java_api or just set the property to enable it ?

behrica avatar Sep 25 '22 20:09 behrica

From pure clojure - You have to have the manual_gil property set when the python.ffi namespace is first compiled. The only reliable way I found is to start the vm with the property enabled. Then just use the ffi namespace functions instead of the java api functions. The initialization is the same either way.

Be sure to check the variable - ffi/manual-gil and make sure it is true. That tells you how the namespace was compiled.

cnuernber avatar Sep 25 '22 20:09 cnuernber

Ok, I am not sure, if I do the right thing. I am not in a repl situation, but I load my clojure file like this now:

export _JAVA_OPTIONS="-Dlibpython_clj.manual_gil=true"
python -u -c 'from clojurebridge import cljbridge;cljbridge.load_clojure_file(clj_file="src/train.clj",mvn_local_repo="/home/carsten/.m2/repository")'

The train.clj looks now like this:

(ns train
  (:require
   [libpython-clj2.python.ffi :as ffi]
   [libpython-clj2.embedded :as embedded]))


(println "-------- manual-gil: " ffi/manual-gil)     ;; prints "true"
(embedded/initialize!)
(ffi/lock-gil)

;;do my training  with simpletransformer
;; disable all multiprocessor as much as I can via model_args


(ffi/unlock-gil)
(println "training finished")

The training started, lets see.

behrica avatar Sep 25 '22 21:09 behrica

You do not need the lock/unlock pairs. When embedded the enclosing python already owns the gil.

cnuernber avatar Sep 25 '22 21:09 cnuernber

Hmm. Without the (ffi/lock-gil) I get a crash:

RNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x0000555f60e8bcd7, pid=85946, tid=85946
#
# JRE version: OpenJDK Runtime Environment (11.0.16.1+1) (build 11.0.16.1+1)
# Java VM: OpenJDK 64-Bit Server VM (11.0.16.1+1, mixed mode, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [python+0x12dcd7]  PyEval_GetGlobals+0x7
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/lib/systemd/systemd-coredump %P %u %g %s %t %c %h" (or dumping to /home/carsten/Dropbox/sources/predict_disaster_tweet/core.85946)
#
# An error report file with more information is saved as:
# /home/carsten/Dropbox/sources/predict_disaster_tweet/hs_err_pid85946.log
#
# If you would like to submit a bug report, please visit:
#   https://bugreport.java.com/bugreport/crash.jsp

behrica avatar Sep 25 '22 22:09 behrica

If I leave it in, training starts and finishes, but the JVM process never ends. It hangs forever on (ffi/unlock-gil)

behrica avatar Sep 25 '22 22:09 behrica

ffi/unlock-gil takes the return value of ffi/lock-gil.

cnuernber avatar Sep 25 '22 22:09 cnuernber

The crash that lock-gil must be necessary. That is really confusing -- does the code in clojurebridge unlock the gil?

cnuernber avatar Sep 25 '22 22:09 cnuernber

It also means our exception handling is poor as exceptions escaping the context are quietly swallowed.

cnuernber avatar Sep 25 '22 22:09 cnuernber

My call to "initalize" created the problem. (as it was initialized already in clojurebridge). But I need to look / unlock , then it works.

(ns train
  (:require
   [libpython-clj2.python.ffi :as ffi]
   [libpython-clj2.embedded :as embedded]))


(println "-------- manual-gil: " ffi/manual-gil)     ;; prints "true"

(def x (ffi/lock-gil))

;;do my training  with simpletransformer
;; disable all multiprocessor as much as I can via model_args


(ffi/unlock-gil x)
(println "training finished")

behrica avatar Sep 25 '22 22:09 behrica

????

If I write (ffi/unlock-gil) without the old state, there is no compile error, bu it hangs ..... How can that be ? The function has 1-arrity ....

behrica avatar Sep 25 '22 23:09 behrica

I only get the exception in repl, but not execution the clj file:

 Unhandled clojure.lang.ArityException
   Wrong number of args (0) passed to: libpython-clj2.python.ffi/unlock-gil

                  AFn.java:  429  clojure.lang.AFn/throwArity
                  AFn.java:   28  clojure.lang.AFn/invoke
                      REPL:   94  train/eval35092
                      REPL:   94  train/eval35092
             Compiler.java: 7181  clojure.lang.Compiler/eval
             Compiler.java: 7136  clojure.lang.Compiler/eval
                  core.clj: 3202  clojure.core/eval
                  core.clj: 3198  clojure.core/eval
    interruptible_eval.clj:   87  nrepl.middleware.interruptible-eval/evaluate/fn/fn
                  AFn.java:  152  clojure.lang.AFn/applyToHelper
                  AFn.java:  144  clojure.lang.AFn/applyTo
                  core.clj:  667  clojure.core/apply

behrica avatar Sep 25 '22 23:09 behrica

Right - we need to fix that but that is separate issue. It escapes the clojurebridge context somehow and isn't reported.

cnuernber avatar Sep 25 '22 23:09 cnuernber

yes... it seems so.

behrica avatar Sep 25 '22 23:09 behrica

Ok, my training works now with small model and liittle data, I will redo with large model and full data.

behrica avatar Sep 25 '22 23:09 behrica

I can confirm the stability for larger data as well.

behrica avatar Sep 26 '22 07:09 behrica

I adapted the code here: https://github.com/behrica/libpython-clj-219

to use "embed" and it runs indeed stable. 0 / 100 failures

I will repeat it with more data and larger model, but it looks promising already.

behrica avatar Sep 26 '22 07:09 behrica

I opened 2 issues in clojurebridge, which should improve the experience of using "embedded libpython-clj with manual GIL management" https://github.com/behrica/clojurebridge/issues/4 https://github.com/behrica/clojurebridge/issues/2

behrica avatar Sep 26 '22 09:09 behrica