SyliusResourceBundle Netty resource leak detector too unreliable?

Netty resource leak detector too unreliable?

Open DerGuteMoritz opened this issue 1 year ago • 19 comments

As per this comment, I get the impression that Netty's builtin leak detector is too unreliable to be useful. For example, when introducing an intionally leaking test case like this:

(deftest test-leak-in-raw-stream-handler
  (with-raw-handler basic-handler
    (let [resp @(http-put (str "http://localhost:" port "/string")
                          ;; NOTE: The request handler doesn't consume
                          ;; this body. As per the :raw-stream?
                          ;; contract, this leads to a buffer leak.
                          ;; However, Netty's leak detector in
                          ;; paranoid mode still doesn't pick it up
                          ;; most of the time.
                          {:body "Hello, world!"})]
      (is (= 200 (:status resp)))
      (is (= "String!" (bs/to-string (:body resp)))))))

Then lein test will happily pass without reporting any leaks even when passing -Dio.netty.leakDetection.level=paranoid. Only when repeatedly running the same case (about 10-20x) via a REPL, the detector eventually notices it. It seems that there is some threshold even in paranoid mode which needs to be reached before the detector fires. So far, I was unable to find any indication of that in the docs, though. On the contrary, it states there:

PARANOID - Same with ADVANCED except that it's for every single buffer. Useful for automated testing phase.

Which seems to confirm that we're using it correctly here. Any ideas?

Aug 05 '22 12:08 DerGuteMoritz

It's useful and reliable on PARANOID mode but you will see the traces only when the GC is triggered. Some objects are eligible for being GC but still get a .refcount of 1.

Aug 05 '22 13:08 arnaudgeiser

@arnaudgeiser Ah, right, forgot to mention in the ticket description that even adding an explicit (System/gc) to the end of the test doesn't help either. I also tried -XX:+UseSerialGC assuming that this one is a bit more deterministic but no dice either. Perhaps there is some other parameter one could tweak to make this work a bit more reliably? :thinking:

Aug 05 '22 13:08 DerGuteMoritz

Porting my comment here:

FTR: I noticed this when trying to add a test case with an intentionally throwing handler: AFAICT, in the :raw-stream? true case, such an unhandled exception would always lead to a leak since the default error handler doesn't care about releasing the bytebufs in the request body stream. This is of course an orthogonal problem which I will file a separate issue about. But not quite sure yet how to go about this with the leak detector being so unreliable disappointed Any hints appreciated.

This is unfortunately too late for Aleph to take care of this. When using :raw-stream?, you have to be aware that some of your Netty ByteBuf on your body (manifold.stream) will never reach your application code in case of Exception. That's your responsibility to catch it, close the stream and drain it to apply .release.

This would be a breaking change with huge consequences. As soon as you are using io.netty.buffer.PooledByteBufAllocator (and thus io.netty.buffer.PooledByteBuf) the last call to .release will bring back the ByteBuf to the pool which can be reused by something else. In the worst-case scenario, we might end up calling .release twice on the same ByteBuf where the second call could release something that does not belong to us anymore.

To me, the only thing we can do here is better documentation.

Aug 06 '22 12:08 arnaudgeiser

@arnaudgeiser Ah, right, forgot to mention in the ticket description that even adding an explicit (System/gc) to the end of the test doesn't help either. I also tried -XX:+UseSerialGC assuming that this one is a bit more deterministic but no dice either. Perhaps there is some other parameter one could tweak to make this work a bit more reliably? thinking

Maybe using -XX:MaxDirectMemorySize=<size> with a small number will log more reliably. I would be really interested to know under which condition the leak detector decides to log. ~~On my mind, on PARANOID mode, it happened always when a io.netty.buffers.ByteBuf got GCed, but it seems I was wrong here.~~

Aug 06 '22 12:08 arnaudgeiser

Actually, it might be partially related to my first comment. By default Netty uses a PooledByteBufAllocator which doesn't make the ByteBuf eligible for being GCed as it got recycled.

By using -Dio.netty.allocator.type=unpooled it should be more reliable. And now, if you:

Run your code
Run the GC
Run your code again

You should always see the traces on 3.

Some pointers to dig into the rabbit hole :

[1] : https://github.com/netty/netty/blob/c8c45cfa4c9632578a772ebcad1edfae431b1011/buffer/src/main/java/io/netty/buffer/AbstractByteBufAllocator.java#L50-L59 [2] : https://github.com/netty/netty/blob/c8c45cfa4c9632578a772ebcad1edfae431b1011/common/src/main/java/io/netty/util/ResourceLeakDetector.java#L293-L311

Aug 06 '22 13:08 arnaudgeiser

@arnaudgeiser Thanks for digging into this with me :pray: Your strategy works indeed when running tests from within a REPL. However, I think this is also somewhat accidental: I've pushed a follow-up commit to the one linked in the description which:

Duplicates the leak test case as test-leak-in-raw-stream-handler2
Adds a :leak tag to both of them and a matching Leiningen test selector
Runs (System/gc) at the end of each test case
Adds -Dio.netty.allocator.type=unpooled and -Dio.netty.leakDetection.level=paranoid to the default :jvm-opts

Running lein test :leak still gives no warning although internally, it should do "the same thing" (modulo all NREPL overhead, of course).

I also tried adding -Dio.netty.noPreferDirect=true option to the mix. AFAIUI, this disables direct buffers in favor of heap allocated buffers which should be even closer to the GC. But that makes no difference either. Seems to me like the GC needs to reach some threshold to actually reclaim these buffer objects ...

WDYT?

Aug 08 '22 11:08 DerGuteMoritz

Ouch, just figured it out! The :test profile has this option:

:jvm-opts ["-Dorg.slf4j.simpleLogger.defaultLogLevel=off"]

Which of course also silences the resource leak detector's output :facepalm: Setting it to error makes lein test :leak produce a leak detection indeed :sweat_drops: Progess! If we can figure out how to make it work a single test case run, we could come up with a custom leak detector which we could block on until either all buffers are released or a leak is detected. The latter case could then even make the causing test fail.

Aug 08 '22 12:08 DerGuteMoritz

AFAIK, the only place where the ResourceLeakDetector is instantiated is here [1] I don't know if providing a custom implementation is the way to go. Maybe just using something like Mockito#spy might be enough for what you are trying to achieve. A little bit of reflection to spy+replace the ResourceLeakDetector and a java.util.concurrent.CountDownLatch#await (with a timeout to avoid blocking the test) might do the trick. [2] [3]

After a second thought, it might not be totally lost for Aleph to handle the .release of the Netty ByteBuf in case of unconsumed body. Instead of using raw-stream? true we could consider something like raw-stream? :close-release which should close the body first and then drain the stream to apply release.

WDYT?

[1] : https://github.com/netty/netty/blob/c8c45cfa4c9632578a772ebcad1edfae431b1011/buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java#L69 [2] : https://github.com/mockito/mockito/issues/1089#issuecomment-733787423 [3] : https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/CountDownLatch.html#await(long,java.util.concurrent.TimeUnit)

Aug 08 '22 14:08 arnaudgeiser

AFAIK, the only place where the ResourceLeakDetector is instantiated is here [1] I don't know if providing a custom implementation is the way to go. Maybe just using something like Mockito#spy might be enough for what you are trying to achieve. A little bit of reflection to spy+replace the ResourceLeakDetector and a java.util.concurrent.CountDownLatch#await (with a timeout to avoid blocking the test) might do the trick. [2] [3]

Oof, what a :rabbit: :hole: I now know more about the inner workings of ResourceLeakDetector than I would ever have wanted :grimacing: Turns out that it's not that easy: The ResourceLeakDetector instance is informed only about allocation (via track()) and then returns a ResourceLeak instance (via DefaultResourceLeak). This is a WeakReference of the tracked object (i.e. ByteBuf in our case) which can be enriched with callsites along the way (via record()). When the ByteBuf is properly released, it will close the leak instance which in turn removes itself directly from the ResourceLeakDetector's internal allLeaks map. In other words: The ResourceLeakDetector itself doesn't get notified about properly released buffers, only leaked ones, so I don't think Mockito#spy would work here. But since I have little experience with that kind of thing, I'd be interested in seeing how to make this work.

Side note: The leak detector reports detected leaks by polling the ReferenceQueue attached to all WeakReferences it hands out) which it only does when track() is called. In other words: To eagerly trigger the detector, one would have to allocate some tracked buffers. To that end, I've experimented with injecting thousands of allocations followed by System/gc at the end of the leaky test case but to no avail. I am still not sure why this didn't work...

In the meantime, I came up with a terrible, I mean interesting hack. Firstly, I noticed we could just apply the "double test run pattern" via a fixture like so: https://github.com/bevuta/aleph/commit/fc2883de926a86d63a7c917337d191a11c355136

Next, I went down the aforementioned rabbithole and came up with a custom leak detector which just intercepts all reports and makes the tests fail via an assertion induced in the fixture: https://github.com/bevuta/aleph/commit/f85d82cfa1dc420e847d42d02d13de9a1866bab7

Now when running the HTTP tests like so: lein with-profiles +eager-leak-detection test aleph.http-test, they fail with

Ran 27 tests containing 638 assertions.
1 failures, 0 errors.

And a somewhat unreadable assertion error which contains the leak report which would usually be logged.

My head is spinning too much now to assess whether this is a viable approach :sweat_smile: Lemme know what you think.

After a second thought, it might not be totally lost for Aleph to handle the .release of the Netty ByteBuf in case of unconsumed body. Instead of using raw-stream? true we could consider something like raw-stream? :close-release which should close the body first and then drain the stream to apply release.

Ah right, this was the original issue which spawned my unhappiness regarding the leak detector ... Let's discuss this in a separate ticket, though!

Aug 08 '22 17:08 DerGuteMoritz

In other words: The ResourceLeakDetector itself doesn't get notified about properly released buffers, only leaked ones

Definitely, my approach was limited to leak detection.

Next, I went down the aforementioned rabbithole and came up with a custom leak detector which just intercepts all reports and makes the tests fail via an assertion induced in the fixture: https://github.com/bevuta/aleph/commit/f85d82cfa1dc420e847d42d02d13de9a1866bab7

Sounds like you find out the easiest approach there. I wasn't aware of the following property io.netty.customResourceLeakDetector and the fact it would have been so simple to implement ResourceLeakDetector (reportTracedLeak, reportUntracedLeak). Let's try to leverage this if we can, great work!

In the meantime, I came up with a terrible, I mean interesting hack. Firstly, I noticed we could just apply the "double test run pattern" via a fixture like so: https://github.com/bevuta/aleph/commit/fc2883de926a86d63a7c917337d191a11c355136

While perfectly valid, I don't know if it's worth being too smart here. Maybe calling the function twice in the test itself (with the System/gc in between) can be enough.

Aug 08 '22 17:08 arnaudgeiser

After a second thought, it might not be totally lost for Aleph to handle the .release of the Netty ByteBuf in case of unconsumed body. Instead of using raw-stream? true we could consider something like raw-stream? :close-release which should close the body first and then drain the stream to apply release.

Ah right, this was the original issue which spawned my unhappiness regarding the leak detector ... Let's discuss this in a separate ticket, though!

Whether or not there is a follow-up, thanks for digging into this. I don't know if a lot of users are aware of raw-stream? and the fact you can end up having some ByteBuf that won't be release if the body is not consumed.

That's not a bug (nor a feature) but a little bit unexpected. :)

Aug 08 '22 17:08 arnaudgeiser

Finally figured out a reliable approach which doesn't require double test run: https://github.com/clj-commons/aleph/commit/0f40c87cea36ca6337e495f3e86314f3523c806f :tada: Tried my best to explain it in the commit message, hope it makes sense! It only adds about 50s extra runtime per test case on my machine. As you can see, there are still some small TODOs but overall, I think this is approaching PR territory :smile: Let me know what you think!

Aug 08 '22 21:08 DerGuteMoritz

Yes, it makes sense to me, I'm just wondering whether this approach is really that reliable.

It only adds about 50s extra runtime per test case on my machine.

50 seconds per test case? I guess I'm inferring the wrong unit here, right?

What I don't really get is in which tests do we want to apply this approach?

To prove we are leaking memory when using raw-stream?
To prove we are not leaking memory when not using raw-stream?

Aug 09 '22 20:08 arnaudgeiser

Yes, it makes sense to me, I'm just wondering whether this approach is really that reliable.

So far it's the most reliable I was able to come up with: In all my attempts so far it always detects the leak in the intentionally leaky test case (which wasn't even true for the "double run" approach) and it never yielded any false positives either. So it's definitely a vast improvement over the status quo.

That being said, thinking about it again now, there is a theoretical chance for the GC to mark the probe as garbage before marking any leaked buffers from the test run as garbage (even though the probe is always leaked after the run). Placing another System/gc call before leaking the probe should lessen that chance even more but of course, GCs not having to be 100% timely, it might still happen.

It only adds about 50s extra runtime per test case on my machine.

50 seconds per test case? I guess I'm inferring the wrong unit here, right?

Whoops, that should have said 50ms :smile: It had gotten a bit late yesterday.

What I don't really get is in which tests do we want to apply this approach?

To prove we are leaking memory when using raw-stream?

Yes, this was my main motivation to come up with this: I wanted a failing test which demonstrates that we're currently leaking when a raw-stream? handler throws before having processed the request body. If we find a way to handle this gracefully, the test could stick around as a regression test. Otherwise, it would serve to assert that we expect a leak in this case.

To prove we are not leaking memory when not using raw-stream?

Yes, as well, and more generally: that we're not leaking memory somewhere within the bowels of Aleph itself. So it would sense to just enable it for the whole test suite IMHO.

Aug 09 '22 22:08 DerGuteMoritz

So it would sense to just enable it for the whole test suite IMHO.

Or at least those test namespaces which currently already contain (netty/leak-detector-level! :paranoid) at the top.

Aug 09 '22 22:08 DerGuteMoritz

Fine for me, let's see how it goes.

Aug 10 '22 07:08 arnaudgeiser

I have no comment, but I'm eagerly awaiting the future results with a bag of popcorn. 😄

Aug 10 '22 10:08 KingMob

Any updates?

Sep 08 '22 06:09 KingMob

Will get back to it soon, have some more pressing issues to tend to at the moment :smile:

Sep 08 '22 11:09 DerGuteMoritz

(Accidentally closed this issue while commenting)

@DerGuteMoritz : Do you think we can close this issue as it was initially a question?

Jan 27 '23 07:01 arnaudgeiser

SyliusResourceBundle SyliusResourceBundle copied to clipboard

Netty resource leak detector too unreliable?

SyliusResourceBundle
SyliusResourceBundle copied to clipboard