grpc-java
grpc-java copied to clipboard
ERROR i.g.n.s.i.n.u.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected.
What version of gRPC-Java are you using?
1.34.1
What is your environment?
Linux alpine image on a pod. JDK 11
What did you expect to see?
no errors
What did you see instead?
error message
Steps to reproduce the bug
I don't know how exactly to reproduce but with me, it is a persistent behavior when I pull async huge amount of messages 2-3 Million with 2-3Kb size from pubSub subscription.
15:01:31.997 [grpc-default-worker-ELG-1-2] ERROR i.g.n.s.i.n.u.ResourceLeakDetector - LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
| Recent access records:
| Created at:
| io.grpc.netty.shaded.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:363)
| io.grpc.netty.shaded.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
| io.grpc.netty.shaded.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:178)
| io.grpc.netty.shaded.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:115)
| io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.allocate(SslHandler.java:2198)
| io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1375)
| io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1275)
| io.grpc.netty.shaded.io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1322)
|io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:501)
| io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:440)
| io.grpc.netty.shaded.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
| io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
| io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
| io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
| io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
| io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
| io.grpc.netty.shaded.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
| io.grpc.netty.shaded.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
| io.grpc.netty.shaded.io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:792)
| io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:475)
| io.grpc.netty.shaded.io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
| io.grpc.netty.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
| io.grpc.netty.shaded.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
| io.grpc.netty.shaded.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
| java.base/java.lang.Thread.run(Thread.java:834)
There's not much to go on here. We call touch()
on the outbound flows, but it seems we don't in the inbound flow. Seems we should enhance the server and client code to call touch()
when we receive the bytes, and maybe call touch again when passing the bytes to the application thread.
https://github.com/grpc/grpc-java/blob/0287d83182493a5c50f09edfa3486c8e514b1c95/netty/src/main/java/io/grpc/netty/NettyServerStream.java#L226
https://github.com/grpc/grpc-java/blob/0287d83182493a5c50f09edfa3486c8e514b1c95/netty/src/main/java/io/grpc/netty/NettyClientStream.java#L341
https://github.com/grpc/grpc-java/blob/0287d83182493a5c50f09edfa3486c8e514b1c95/core/src/main/java/io/grpc/internal/MessageDeframer.java#L412
Although generally we'd expect both of those touches occur and instead some interceptor loses the reference or a Runnable was rejected which leaked the bytes.
We're seeing this sporadically on outbound messages, rather than inbound. The gRPC part of the stack trace looks like this:
2022-03-31 02:57:43,525 ERROR [grpc-default-worker-ELG-1-5] io.netty.util.ResourceLeakDetector LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records:
Created at:
io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:402)
io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:188)
io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:124)
io.grpc.netty.NettyWritableBufferAllocator.allocate(NettyWritableBufferAllocator.java:51)
io.grpc.internal.MessageFramer.writeRaw(MessageFramer.java:285)
io.grpc.internal.MessageFramer.access$400(MessageFramer.java:43)
io.grpc.internal.MessageFramer$OutputStreamAdapter.write(MessageFramer.java:375)
com.google.protobuf.CodedOutputStream$OutputStreamEncoder.doFlush(CodedOutputStream.java:3062)
com.google.protobuf.CodedOutputStream$OutputStreamEncoder.flushIfNotAvailable(CodedOutputStream.java:3057)
com.google.protobuf.CodedOutputStream$OutputStreamEncoder.writeUInt32NoTag(CodedOutputStream.java:2885)
com.google.protobuf.CodedOutputStream$OutputStreamEncoder.writeTag(CodedOutputStream.java:2721)
com.google.protobuf.CodedOutputStream$OutputStreamEncoder.writeBytes(CodedOutputStream.java:2774)
com.atlassian.bitbucket.mesh.rpc.v1.RpcGetCommitsRequest.writeTo(RpcGetCommitsRequest.java:736)
com.google.protobuf.AbstractMessageLite.writeTo(AbstractMessageLite.java:83)
io.grpc.protobuf.lite.ProtoInputStream.drainTo(ProtoInputStream.java:52)
io.grpc.internal.MessageFramer.writeToOutputStream(MessageFramer.java:267)
io.grpc.internal.MessageFramer.writeKnownLengthUncompressed(MessageFramer.java:229)
io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:168)
io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:141)
io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:65)
io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37)
io.grpc.internal.RetriableStream$1SendMessageEntry.runWith(RetriableStream.java:549)
io.grpc.internal.RetriableStream.delayOrExecute(RetriableStream.java:526)
io.grpc.internal.RetriableStream.sendMessage(RetriableStream.java:553)
io.grpc.internal.ClientCallImpl.sendMessageInternal(ClientCallImpl.java:520)
io.grpc.internal.ClientCallImpl.sendMessage(ClientCallImpl.java:506)
io.grpc.ForwardingClientCall.sendMessage(ForwardingClientCall.java:37)
io.grpc.ForwardingClientCall.sendMessage(ForwardingClientCall.java:37)
io.grpc.ForwardingClientCall.sendMessage(ForwardingClientCall.java:37)
io.grpc.ForwardingClientCall.sendMessage(ForwardingClientCall.java:37)
io.grpc.ForwardingClientCall.sendMessage(ForwardingClientCall.java:37)
io.grpc.ForwardingClientCall.sendMessage(ForwardingClientCall.java:37)
io.grpc.stub.ClientCalls.asyncUnaryRequestCall(ClientCalls.java:308)
io.grpc.stub.ClientCalls.asyncUnaryRequestCall(ClientCalls.java:294)
io.grpc.stub.ClientCalls.asyncServerStreamingCall(ClientCalls.java:81)
Could it be that the readableBytes() == 0
check in MessageFramer.close
is causing release()
to not be called when it should be? (It's obvious that it would prevent release()
from being called, but there may be a reason why that's correct that I simply don't know/understand.) There's a comment above it about leaks, but I'm not sure what it's trying to say. (If that code really doesn't expect a value greater than 0, it seems like it ought to log something if that invariant doesn't hold; that'd make this easier to diagnose if that is really is the cause.) It seems a bit nonsequitor that MessageFramer.close
would first release the buffer and then commitToSink
--which from what I can tell writes from the buffer that was potentially just released.
@bturner, could you file a new issue? It's highly, highly doubtful that inbound and outbound are releated.
The check in MessageFramer.close()
looks suspicious in general, but I don't think it is related. It looks like it is preventing zero-byte frames from going out to the transport. releaseBuffer()
null
s out the buffer reference, so the just-released buffer won't be used. I vaguely remember we used to treat zero-length buffers specially for a while; I'm not sure if that check is still necessary though.
In #10542, I discovered that we essentially already had touches when we do retain(). Since there wasn't any such retain in the recent access records, that means gRPC didn't take ownership of the buffers. There may have been a bug in Netty, or an exception might have caused a rare code flow.
There's not been any other reports of this, and in #10542 I am improving the touch() like I mentioned in an earlier comment. I think this is resolved, as there's not much more we can do and it may not have been our bug. We'd be interested in any new report though.