boot icon indicating copy to clipboard operation
boot copied to clipboard

Fix memory leak in pods

Open arichiardi opened this issue 8 years ago • 50 comments

This was particularly evident when doing boot watch test and has been fixed thanks to Micha's intuition of adding a clojure.core/shutdown-agents in boot.pod/destroy-pod.

Fixes https://github.com/adzerk-oss/boot-test/issues/26

arichiardi avatar Jun 23 '16 22:06 arichiardi

This still feels not right, please don't merge it.

I tracked down the call to the clojure agent pool to be here.

More to come :smile:

arichiardi avatar Jun 26 '16 00:06 arichiardi

The call to shutdown-agents is actually already done in the .close method on the pod, which is a call to this.

Having spent a lot of time profiling for memory leaks with pods (see https://github.com/boot-clj/boot/issues/268), my own experience (once those fixed were made) has been that either your own code or libraries are creating resources that need to be cleared. core.async, for example, creates a global executor service that must be freed otherwise you'll see a class leak at the very least since Clojure itself cannot be gc'd.

We use this as our custom destroy-pod function:

(defn destroy-pod
  [^ClojureRuntimeShim pod]
  (when pod
    (boot.pod/with-eval-in pod
      (when (find-ns 'clojure.core.async.impl.exec.threadpool)
        (let [e (resolve 'clojure.core.async.impl.exec.threadpool/the-executor)]
          (.shutdownNow ^java.util.concurrent.ExecutorService (var-get e))))

      (when (find-ns 'clojure.core.async.impl.timers)
        ; This will cause the thread to go into a terminated state (and print a ThreadDeath error)
        (let [t (resolve 'clojure.core.async.impl.timers/timeout-daemon)]
          (try (.interrupt ^Thread (var-get t))
            (catch Throwable t))))

      (when (find-ns 'datomic.api)
        (let [f (var-get (resolve 'datomic.api/shutdown))]
          (f false))))

    (.close pod)

    (.close ^java.net.URLClassLoader (.getClassLoader pod))))

stephenbrady avatar Jun 26 '16 02:06 stephenbrady

Oh, I am using core.async in my app, let me try that!

arichiardi avatar Jun 26 '16 02:06 arichiardi

Same applies to thread pools I guess...which is maybe why you have the datomic shutdown...

arichiardi avatar Jun 26 '16 02:06 arichiardi

Yep, any resource at all will do it. It's unfortunate that a decent number of libraries create resources for you by virtue of requiring them and don't properly expose those resources to be managed.

stephenbrady avatar Jun 26 '16 02:06 stephenbrady

Yes unfortunate, I avoided starting the db and added your destroy fn but still the memory grows and grows...I am using the Hikary pool, which now I need to try to shutdown properly I guess, btw thanks for the hint.

arichiardi avatar Jun 26 '16 02:06 arichiardi

Sure thing. Good luck. I spent way too many hours chasing this; I feel the pain.

stephenbrady avatar Jun 26 '16 02:06 stephenbrady

It is still very weird that the biggest memory consumption happen in clojure-agent-send-off-pool-1...

arichiardi avatar Jun 26 '16 02:06 arichiardi

Ah well, not at all weird as we are running inside a future...https://github.com/arichiardi/boot/blob/fix-pod-leak/boot/core/src/boot/core.clj#L945

arichiardi avatar Jun 26 '16 02:06 arichiardi

Yep, the symptom, not the cause.

stephenbrady avatar Jun 26 '16 02:06 stephenbrady

So basically every defstate I have is not an atom so it contains a reference to whatever it is right? So every defstate should actually be an nillable atom...

arichiardi avatar Jun 26 '16 02:06 arichiardi

Seems like you're using mount. I'm only somewhat familiar with mount, but I thought it offered an api to terminate the resource held by the defstate var. Otherwise, you can just alter-var-root and set the var to nil (or explicitly close the resource at the var, if it has such a method)...or do the equivalent with an atom.

stephenbrady avatar Jun 26 '16 03:06 stephenbrady

Yes it does provide that, on stop I will nil the atom(s) that I am going to use instead of having pure references... of course this is only a leak during recurring testing, potentially all apps with state leak in this way...

arichiardi avatar Jun 26 '16 03:06 arichiardi

This also means that each boot task which wants to be used in a watch should provide a way to hook up a destroy function..

arichiardi avatar Jun 26 '16 04:06 arichiardi

About this, maybe we should enable custom code in destroy-pod, kind of like cleanup but for a specific pod?

arichiardi avatar Jun 29 '16 17:06 arichiardi

Perhaps JMX can be used to find leaks automatically?

The pod-pool function already accepts a :destroy option that can be used to finalize things in the pod, and if you're calling destroy-pod manually you can evaluate any code you like in the pod before you do that, no?

micha avatar Jun 29 '16 18:06 micha

Ah, maybe something like a shutdown hook that can be installed in the pod to perform cleanup...destroy-pod would call those before finishing?

micha avatar Jun 29 '16 18:06 micha

Yes you are right, there are already many hooks, but the problem arises when a task (boot-test) is executing in a pod and I don't have a way to execute custom custom at destroy time..

arichiardi avatar Jun 29 '16 18:06 arichiardi

Yeah exactly

arichiardi avatar Jun 29 '16 18:06 arichiardi

makes sense to me

micha avatar Jun 29 '16 18:06 micha

Maybe instead of an oblect here https://github.com/arichiardi/boot/blob/fix-pod-leak/boot/base/src/main/java/boot/App.java#L326 we pass an (thread-safe) container of functions? I was checking where/when it is better to add this

arichiardi avatar Jun 29 '16 18:06 arichiardi

Or an atom in boot.pod that you can conj thunks onto. Then destroy-pod could do something like

(with-eval-in p (doseq [x @boot.pod/destroy-hooks] (x)))

micha avatar Jun 29 '16 18:06 micha

Yes that would work too

arichiardi avatar Jun 29 '16 18:06 arichiardi

Any objection on having a

(defn destroy-pod
  "Closes open resources held by the pod, making the pod eligible for GC."
  [pod]
  (when pod
    (*destroy-pod-hook*) ;; dynamic
    (.close pod)
    (.. pod getClassLoader close)))

instead?

The caller would need to do (simplified):

(boot
  (binding boot.pod/*destroy-pod-hook* my-cleaning-fn
    (boot-test/test)))

arichiardi avatar Jun 29 '16 20:06 arichiardi

The latter also means that all the code inside the binding will "see" the destroy function, aka, all the call to pods created inside the binding will call that function. This is not a surprising behavior, the contrary, giving a super fine grained control (i.e.: one custom destroy hook per pod), is not really a common use case imho.

arichiardi avatar Jun 29 '16 20:06 arichiardi

Possibly... although it seems like there is a more elegant solution somewhere, maybe?

Another option to consider is reaping defunct pods out-of-band. Another thread could periodically iterate over the pods in boot.pod/pods and if they've been destroyed it could perform actions on them to force them to die.

I would really prefer some way to use reflection or JMX debugging APIs or something like that to find threads that need to be shut down, automatically, without the need for hooks.

I would also like to find out which things can prevent a pod from being eligible for GC.

micha avatar Jun 29 '16 20:06 micha

Regarding the *hooks* vs. pod/hooks, it might be a "bothism"; we might want both. The *hooks* thing is good when you need to inject finalization procedures into code that you don't control (like your boot-test/test example). The pod/hooks approach, on the other hand, allows the code that creates the pod to also install the finalization code, which isn't really feasible with the dynamic vars way.

micha avatar Jun 29 '16 20:06 micha

Good points in both your comments, I was already working on this, but then stopped. In the particular problem I am facing, I cannot know the pod boot.test is creating and unless boot.test/test itself provides a hook in the params, I cannot execute any cleaning code.

I could create a task that does what you said above though, I am in a watch, and I could scan the pods, see the ones that have been destroyed (how to identify them?) and just remove their reference/step heavily on them so that they really die :smile:

arichiardi avatar Jun 29 '16 20:06 arichiardi

Also an atom probably would not solve my problem because again, how can I know the pod boot.test/test used?

arichiardi avatar Jun 29 '16 20:06 arichiardi

Yes, that's why maybe we need both. I mean if I make a task that creates a pod and my task is loading core.async in the pod, then my task should also ensure it gets cleaned up. How would that work with the dynamic var?

micha avatar Jun 29 '16 20:06 micha