pulsar icon indicating copy to clipboard operation
pulsar copied to clipboard

[improve][misc][WIP] Detect "double release" and "use after release" bugs with recycled objects

Open lhotari opened this issue 1 year ago • 9 comments

Motivation

In Pulsar, users have reported issues that could be caused by "double release" or "use after release" bugs with recycled objects.

Here's are some issues that could potentially be caused by "double release" or "use after release" bugs: #22035 #21892

Other example of such potential issue: #21421/#21933. It is possible that these issues are fixed by https://github.com/apache/bookkeeper/pull/4196. The other root cause could be a "double release" or "use after release" bug which is corrupting the buffer and causing checksum calculation to fail.

Outside of Java, there's a known bug pattern called "double-free" or "Doubly freeing memory" (with malloc). The "double release" bug pattern is a bit similar, but happens with the recycled object pattern using Netty's Recycler that Pulsar uses because of performance reasons.

There is also a "use after free" bug pattern. Something similar could be happen with Netty recycled objects that the object instance gets used after releasing.

The solution in this PR attempts to help detect "double release" and "use after release" bugs.

Modifications

  • get rid of any .setRefCnt(1) calls in production code since that could hide real issues
  • add additional detection feature that can be disabled by setting the system property -Dpulsar.refcount.check.on_access=false.
    • the refcount check will add a volatile read. This isn't a performance problem in most use cases. For production usage we could add -Dpulsar.refcount.check.on_access=false into the bin/pulsar script by default if we are afraid of the performance overhead. For all tests, we should be running with checks enabled so that we could find the source of problems.

Documentation

  • [ ] doc
  • [ ] doc-required
  • [x] doc-not-needed
  • [ ] doc-complete

lhotari avatar Feb 23 '24 16:02 lhotari

After making the changes, there are a lot of unit test failures. I haven't had a chance to look into the details. This PR is still a very early proposal about how to start detecting "double release" and "use after release" bugs.

lhotari avatar Feb 23 '24 20:02 lhotari

The recycled objects are widely used in Pulsar, not only for EntryImpl. The 1st concern is that should we apply checks to all these places? For example, the client side could also use recycled objects.

@BewareMyPower Yes that's a good point. It looks like the AbstractCASReferenceCounted/AbstractReferenceCounted base class prevents a lot of the problems since there would be exceptions at some point if there would be execution paths where "double release" or "use after release" bugs existed.

The 2nd concern is, I'm afraid currently Pulsar allows a recycled object is accessed with a "null check". We need to investigate such cases.

Yes, those are cases where the problem is getting hidden.

It would be great to find a better way to track down the issues. That's why I started this PR which is more like an experiment to find some way that would work for detecting "double release" and "use after release" bugs and also raise awareness of such bug patterns with the recycled objects. These are bug patterns that most Java developers have never had to deal with because of Java's garbage collection. With recycled objects and Netty ByteBufs, that all changes.

lhotari avatar Feb 26 '24 05:02 lhotari

I guess there's also the possibility of "double release" and "use after release" bugs with Netty ByteBufs.

In Netty, there's the leak detector for detecting when you don't release buffers, but there seems to be nothing for detecting the "use after release" bugs. I guess "double release" would be detected with the io.netty.buffer.AbstractReferenceCountedByteBuf base class which will throw an exception on double release. It leaves the "use after release" bugs undetected.

It seems that the solution might be a Java Agent written with Byte Buddy etc. which would add additional checks with byte code instrumentation when the agent is activated. Thinking something like https://github.com/reactor/BlockHound but for a completely different purpose, to help detect "use after release" bugs.

lhotari avatar Feb 26 '24 05:02 lhotari

The 2nd concern is, I'm afraid currently Pulsar allows a recycled object is accessed with a "null check". We need to investigate such cases.

Getting back to this one more time. Netty protects against most "use after release" ByteBuf bugs by setting the fields to null and the NPEs would be popping up as a sign of issues. Therefore it's extremely important that NPEs aren't suppressed with null checks.

lhotari avatar Feb 26 '24 06:02 lhotari

Pattern used in Netty to detect "use after release" bugs:

https://github.com/netty/netty/blob/c07dec11a90b82674db1529d0402af8e62d6589f/buffer/src/main/java/io/netty/buffer/AbstractByteBuf.java#L1431-L1456

https://github.com/netty/netty/blob/94cfa608a8071e8a9005e0d52e0199b3584f4ae7/buffer/src/main/java/io/netty/buffer/ByteBuf.java#L2485-L2491

lhotari avatar May 17 '24 09:05 lhotari

There are 2 ByteBuf reference count handling issues in Bookkeeper client that have been recently fixed: https://github.com/apache/bookkeeper/pull/4289 and https://github.com/apache/bookkeeper/pull/4293 . These fixes are expected to be delivered in Bookkeeper 4.16.6 and 4.17.1 so that we could deliver the fixes to Pulsar releases 3.0.6, 3.2.4 and 3.3.1.

lhotari avatar May 24 '24 09:05 lhotari

I recently discovered another bug pattern with ByteBufs which is due to an incorrect assumption of how Netty ByteBuf reference counting works for derived buffers.

Writing some pseudo code to explain:

ByteBuf buf = PulsarByteBufAllocator.DEFAULT.buffer(16);
...
ByteBuf cachedBuffer = buf.duplicate().retain();
...
buf.release();
...

The assumption could be that the above code is correct. Why is it wrong?

To understand how a duplicate buffer works, one could take a look at the source, in AbstractPooledDerivedByteBuf. https://github.com/netty/netty/blob/5085cef149134951c94a02a743ed70025c8cdad4/buffer/src/main/java/io/netty/buffer/AbstractPooledDerivedByteBuf.java#L32-L37

The reference count of the duplicated buffer (extends AbstractPooledDerivedByteBuf in the case of pooled buffers) is independent of the parent buffer. It will call .release() on the parent buffer when the reference count of the duplicated buffer reaches 0. Understanding this will help use .retain() and .release() correctly when .duplicate() or .slice() is used.

lhotari avatar May 24 '24 09:05 lhotari

One potential source of double release bugs is the race conditions in RangeCache implementation used for the broker cache. There are 2 PRs to address the race conditions: https://github.com/apache/pulsar/pull/22789 and https://github.com/apache/pulsar/pull/22814

lhotari avatar May 31 '24 10:05 lhotari

The Netty 4.1.111.Final upgrade will prevent some problems in this area, more details in #22892

lhotari avatar Jun 19 '24 14:06 lhotari