SyliusResourceBundle
SyliusResourceBundle copied to clipboard
Netty resource leak detector too unreliable?
As per this comment, I get the impression that Netty's builtin leak detector is too unreliable to be useful. For example, when introducing an intionally leaking test case like this:
(deftest test-leak-in-raw-stream-handler
(with-raw-handler basic-handler
(let [resp @(http-put (str "http://localhost:" port "/string")
;; NOTE: The request handler doesn't consume
;; this body. As per the :raw-stream?
;; contract, this leads to a buffer leak.
;; However, Netty's leak detector in
;; paranoid mode still doesn't pick it up
;; most of the time.
{:body "Hello, world!"})]
(is (= 200 (:status resp)))
(is (= "String!" (bs/to-string (:body resp)))))))
Then lein test
will happily pass without reporting any leaks even when passing -Dio.netty.leakDetection.level=paranoid
. Only when repeatedly running the same case (about 10-20x) via a REPL, the detector eventually notices it. It seems that there is some threshold even in paranoid mode which needs to be reached before the detector fires. So far, I was unable to find any indication of that in the docs, though. On the contrary, it states there:
PARANOID - Same with ADVANCED except that it's for every single buffer. Useful for automated testing phase.
Which seems to confirm that we're using it correctly here. Any ideas?
It's useful and reliable on PARANOID
mode but you will see the traces only when the GC is triggered.
Some objects are eligible for being GC but still get a .refcount
of 1.
@arnaudgeiser Ah, right, forgot to mention in the ticket description that even adding an explicit (System/gc)
to the end of the test doesn't help either. I also tried -XX:+UseSerialGC
assuming that this one is a bit more deterministic but no dice either. Perhaps there is some other parameter one could tweak to make this work a bit more reliably? :thinking:
Porting my comment here:
FTR: I noticed this when trying to add a test case with an intentionally throwing handler: AFAICT, in the :raw-stream? true case, such an unhandled exception would always lead to a leak since the default error handler doesn't care about releasing the bytebufs in the request body stream. This is of course an orthogonal problem which I will file a separate issue about. But not quite sure yet how to go about this with the leak detector being so unreliable disappointed Any hints appreciated.
This is unfortunately too late for Aleph to take care of this.
When using :raw-stream?
, you have to be aware that some of your Netty ByteBuf on your body (manifold.stream
) will never reach your application code in case of Exception.
That's your responsibility to catch it, close the stream and drain it to apply .release
.
This would be a breaking change with huge consequences.
As soon as you are using io.netty.buffer.PooledByteBufAllocator
(and thus io.netty.buffer.PooledByteBuf
) the last call to .release
will bring back the ByteBuf to the pool which can be reused by something else.
In the worst-case scenario, we might end up calling .release
twice on the same ByteBuf
where the second call could release something that does not belong to us anymore.
To me, the only thing we can do here is better documentation.
@arnaudgeiser Ah, right, forgot to mention in the ticket description that even adding an explicit (System/gc) to the end of the test doesn't help either. I also tried -XX:+UseSerialGC assuming that this one is a bit more deterministic but no dice either. Perhaps there is some other parameter one could tweak to make this work a bit more reliably? thinking
Maybe using -XX:MaxDirectMemorySize=<size>
with a small number will log more reliably.
I would be really interested to know under which condition the leak detector decides to log.
~~On my mind, on PARANOID
mode, it happened always when a io.netty.buffers.ByteBuf
got GCed,
but it seems I was wrong here.~~
Actually, it might be partially related to my first comment.
By default Netty uses a PooledByteBufAllocator
which doesn't
make the ByteBuf eligible for being GCed as it got recycled.
By using -Dio.netty.allocator.type=unpooled
it should be more reliable.
And now, if you:
- Run your code
- Run the GC
- Run your code again
You should always see the traces on 3.
Some pointers to dig into the rabbit hole :
[1] : https://github.com/netty/netty/blob/c8c45cfa4c9632578a772ebcad1edfae431b1011/buffer/src/main/java/io/netty/buffer/AbstractByteBufAllocator.java#L50-L59 [2] : https://github.com/netty/netty/blob/c8c45cfa4c9632578a772ebcad1edfae431b1011/common/src/main/java/io/netty/util/ResourceLeakDetector.java#L293-L311
@arnaudgeiser Thanks for digging into this with me :pray: Your strategy works indeed when running tests from within a REPL. However, I think this is also somewhat accidental: I've pushed a follow-up commit to the one linked in the description which:
- Duplicates the leak test case as
test-leak-in-raw-stream-handler2
- Adds a
:leak
tag to both of them and a matching Leiningen test selector - Runs
(System/gc)
at the end of each test case - Adds
-Dio.netty.allocator.type=unpooled
and-Dio.netty.leakDetection.level=paranoid
to the default:jvm-opts
Running lein test :leak
still gives no warning although internally, it should do "the same thing" (modulo all NREPL overhead, of course).
I also tried adding -Dio.netty.noPreferDirect=true
option to the mix. AFAIUI, this disables direct buffers in favor of heap allocated buffers which should be even closer to the GC. But that makes no difference either. Seems to me like the GC needs to reach some threshold to actually reclaim these buffer objects ...
WDYT?
Ouch, just figured it out! The :test
profile has this option:
:jvm-opts ["-Dorg.slf4j.simpleLogger.defaultLogLevel=off"]
Which of course also silences the resource leak detector's output :facepalm: Setting it to error
makes lein test :leak
produce a leak detection indeed :sweat_drops: Progess! If we can figure out how to make it work a single test case run, we could come up with a custom leak detector which we could block on until either all buffers are released or a leak is detected. The latter case could then even make the causing test fail.
AFAIK, the only place where the ResourceLeakDetector
is instantiated is here [1]
I don't know if providing a custom implementation is the way to go.
Maybe just using something like Mockito#spy
might be enough for what you are trying to achieve.
A little bit of reflection to spy+replace the ResourceLeakDetector
and a java.util.concurrent.CountDownLatch#await
(with a timeout to avoid blocking the test) might do the trick. [2] [3]
After a second thought, it might not be totally lost for Aleph to handle the .release
of the Netty ByteBuf in case of unconsumed body.
Instead of using raw-stream? true
we could consider something like raw-stream? :close-release
which should close
the body first and then drain the stream to apply release
.
WDYT?
[1] : https://github.com/netty/netty/blob/c8c45cfa4c9632578a772ebcad1edfae431b1011/buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java#L69 [2] : https://github.com/mockito/mockito/issues/1089#issuecomment-733787423 [3] : https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/concurrent/CountDownLatch.html#await(long,java.util.concurrent.TimeUnit)
AFAIK, the only place where the
ResourceLeakDetector
is instantiated is here [1] I don't know if providing a custom implementation is the way to go. Maybe just using something likeMockito#spy
might be enough for what you are trying to achieve. A little bit of reflection to spy+replace theResourceLeakDetector
and ajava.util.concurrent.CountDownLatch#await
(with a timeout to avoid blocking the test) might do the trick. [2] [3]
Oof, what a :rabbit: :hole: I now know more about the inner workings of ResourceLeakDetector
than I would ever have wanted :grimacing: Turns out that it's not that easy: The ResourceLeakDetector
instance is informed only about allocation (via track()
) and then returns a ResourceLeak
instance (via DefaultResourceLeak
). This is a WeakReference
of the tracked object (i.e. ByteBuf
in our case) which can be enriched with callsites along the way (via record()
). When the ByteBuf
is properly released, it will close the leak instance which in turn removes itself directly from the ResourceLeakDetector
's internal allLeaks
map. In other words: The ResourceLeakDetector
itself doesn't get notified about properly released buffers, only leaked ones, so I don't think Mockito#spy
would work here. But since I have little experience with that kind of thing, I'd be interested in seeing how to make this work.
Side note: The leak detector reports detected leaks by polling the ReferenceQueue
attached to all WeakReference
s it hands out) which it only does when track()
is called. In other words: To eagerly trigger the detector, one would have to allocate some tracked buffers. To that end, I've experimented with injecting thousands of allocations followed by System/gc
at the end of the leaky test case but to no avail. I am still not sure why this didn't work...
In the meantime, I came up with a terrible, I mean interesting hack. Firstly, I noticed we could just apply the "double test run pattern" via a fixture like so: https://github.com/bevuta/aleph/commit/fc2883de926a86d63a7c917337d191a11c355136
Next, I went down the aforementioned rabbithole and came up with a custom leak detector which just intercepts all reports and makes the tests fail via an assertion induced in the fixture: https://github.com/bevuta/aleph/commit/f85d82cfa1dc420e847d42d02d13de9a1866bab7
Now when running the HTTP tests like so: lein with-profiles +eager-leak-detection test aleph.http-test
, they fail with
Ran 27 tests containing 638 assertions.
1 failures, 0 errors.
And a somewhat unreadable assertion error which contains the leak report which would usually be logged.
My head is spinning too much now to assess whether this is a viable approach :sweat_smile: Lemme know what you think.
After a second thought, it might not be totally lost for Aleph to handle the
.release
of the Netty ByteBuf in case of unconsumed body. Instead of usingraw-stream? true
we could consider something likeraw-stream? :close-release
which shouldclose
the body first and then drain the stream to applyrelease
.
Ah right, this was the original issue which spawned my unhappiness regarding the leak detector ... Let's discuss this in a separate ticket, though!
In other words: The ResourceLeakDetector itself doesn't get notified about properly released buffers, only leaked ones
Definitely, my approach was limited to leak detection.
Next, I went down the aforementioned rabbithole and came up with a custom leak detector which just intercepts all reports and makes the tests fail via an assertion induced in the fixture: https://github.com/bevuta/aleph/commit/f85d82cfa1dc420e847d42d02d13de9a1866bab7
Sounds like you find out the easiest approach there.
I wasn't aware of the following property io.netty.customResourceLeakDetector
and
the fact it would have been so simple to implement ResourceLeakDetector
(reportTracedLeak, reportUntracedLeak).
Let's try to leverage this if we can, great work!
In the meantime, I came up with a terrible, I mean interesting hack. Firstly, I noticed we could just apply the "double test run pattern" via a fixture like so: https://github.com/bevuta/aleph/commit/fc2883de926a86d63a7c917337d191a11c355136
While perfectly valid, I don't know if it's worth being too smart here.
Maybe calling the function twice in the test itself (with the System/gc
in between) can be enough.
After a second thought, it might not be totally lost for Aleph to handle the .release of the Netty ByteBuf in case of unconsumed body. Instead of using raw-stream? true we could consider something like raw-stream? :close-release which should close the body first and then drain the stream to apply release.
Ah right, this was the original issue which spawned my unhappiness regarding the leak detector ... Let's discuss this in a separate ticket, though!
Whether or not there is a follow-up, thanks for digging into this.
I don't know if a lot of users are aware of raw-stream?
and the fact
you can end up having some ByteBuf
that won't be release if the
body is not consumed.
That's not a bug (nor a feature) but a little bit unexpected. :)
Finally figured out a reliable approach which doesn't require double test run: https://github.com/clj-commons/aleph/commit/0f40c87cea36ca6337e495f3e86314f3523c806f :tada: Tried my best to explain it in the commit message, hope it makes sense! It only adds about 50s extra runtime per test case on my machine. As you can see, there are still some small TODOs but overall, I think this is approaching PR territory :smile: Let me know what you think!
Yes, it makes sense to me, I'm just wondering whether this approach is really that reliable.
It only adds about 50s extra runtime per test case on my machine.
50 seconds per test case? I guess I'm inferring the wrong unit here, right?
What I don't really get is in which tests do we want to apply this approach?
- To prove we are leaking memory when using
raw-stream?
- To prove we are not leaking memory when not using
raw-stream?
Yes, it makes sense to me, I'm just wondering whether this approach is really that reliable.
So far it's the most reliable I was able to come up with: In all my attempts so far it always detects the leak in the intentionally leaky test case (which wasn't even true for the "double run" approach) and it never yielded any false positives either. So it's definitely a vast improvement over the status quo.
That being said, thinking about it again now, there is a theoretical chance for the GC to mark the probe as garbage before marking any leaked buffers from the test run as garbage (even though the probe is always leaked after the run). Placing another System/gc
call before leaking the probe should lessen that chance even more but of course, GCs not having to be 100% timely, it might still happen.
It only adds about 50s extra runtime per test case on my machine.
50 seconds per test case? I guess I'm inferring the wrong unit here, right?
Whoops, that should have said 50ms :smile: It had gotten a bit late yesterday.
What I don't really get is in which tests do we want to apply this approach?
- To prove we are leaking memory when using
raw-stream?
Yes, this was my main motivation to come up with this: I wanted a failing test which demonstrates that we're currently leaking when a raw-stream?
handler throws before having processed the request body. If we find a way to handle this gracefully, the test could stick around as a regression test. Otherwise, it would serve to assert that we expect a leak in this case.
- To prove we are not leaking memory when not using
raw-stream?
Yes, as well, and more generally: that we're not leaking memory somewhere within the bowels of Aleph itself. So it would sense to just enable it for the whole test suite IMHO.
So it would sense to just enable it for the whole test suite IMHO.
Or at least those test namespaces which currently already contain (netty/leak-detector-level! :paranoid)
at the top.
Fine for me, let's see how it goes.
I have no comment, but I'm eagerly awaiting the future results with a bag of popcorn. 😄
Any updates?
Will get back to it soon, have some more pressing issues to tend to at the moment :smile:
(Accidentally closed this issue while commenting)
@DerGuteMoritz : Do you think we can close this issue as it was initially a question?