grpc-java
grpc-java copied to clipboard
Ocasional NullPointerException in io.grpc.internal.RetriableStream.drain(RetriableStream.java:279) when using hedgingPolicy
What version of gRPC-Java are you using?
Version 1.42.2
What is your environment?
Linux
What did you expect to see?
I'm using the hedging retry policy via configuration and ocasionally see a NullPointerException pop up.
Here's a snippet of the Kotlin code that configures the hedging policy with an 85ms hedging delay:
.defaultServiceConfig(
mapOf(
"loadBalancingPolicy" to "round_robin",
"methodConfig" to listOf(
mapOf(
"name" to listOf(
mapOf(
"service" to "my.org.Service",
"method" to "MyMethod"
)
),
"waitForReady" to true,
"hedgingPolicy" to mapOf(
"maxAttempts" to 2.
"hedgingDelay" to "0.085s",
"nonFatalStatusCodes" to listOf(Status.UNAVAILABLE.code.name)
)
)
)
)
)
What did you see instead?
java.lang.NullPointerException: null
at io.grpc.internal.RetriableStream.drain(RetriableStream.java:279)
at io.grpc.internal.RetriableStream.access$1100(RetriableStream.java:55)
at io.grpc.internal.RetriableStream$HedgingRunnable$1.run(RetriableStream.java:476)
at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:79)
at io.micrometer.core.instrument.Timer.lambda$wrap$0(Timer.java:160)
at io.micronaut.scheduling.instrument.InvocationInstrumenterWrappedRunnable.run(InvocationInstrumenterWrappedRunnable.java:47)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:834)
Steps to reproduce the bug
Enable hedging policy above
Interesting. I think this is a regression from https://github.com/grpc/grpc-java/commit/cbda32a3c1c8e253d8d78afef4b5f934f0db1826 . It seems state.buffer == null which means passthrough == true which means some other stream was committed and this stream should be cancelled. But it hadn't processed the enough of the buffer yet to start the stream which means it has trouble cancelling it.
This is a race and seems could only be avoided by avoiding hedging. Retries would still be safe.
I see. Thanks @ejona86 for triaging this so quickly.
Hedging is working really well for us to reduce tail latency. I'm hoping we can continue using it and help resolve the race condition.
Actually... what was the impact to the RPC in this case? It looks like this might just be logged but the RPC could still complete successfully. Which means things may mostly work if you just ignore the log.
There is a risk of a memory leak developing, but since it didn't end up starting the stream, I think gRPC wouldn't leak any memory when using the Netty transport and a connection is available. If there is no connection available, it might leak.
This is a race and seems could only be avoided by avoiding hedging
This was in the context of a workaround for the problem.
Actually... what was the impact to the RPC in this case? It looks like this might just be logged but the RPC could still complete successfully. Which means things may mostly work if you just ignore the log.
From what I can tell, all requests are succeeding. This is just logged on the side.
That said, I did see 1 single instance of this message:
LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records:
Created at:
io.grpc.netty.shaded.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:402)
io.grpc.netty.shaded.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
io.grpc.netty.shaded.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:123)
io.grpc.netty.shaded.io.grpc.netty.NettyWritableBufferAllocator.allocate(NettyWritableBufferAllocator.java:51)
io.grpc.internal.MessageFramer.writeKnownLengthUncompressed(MessageFramer.java:226)
io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:168)
io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:141)
io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:65)
io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37)
io.grpc.internal.RetriableStream$1SendMessageEntry.runWith(RetriableStream.java:549)
io.grpc.internal.RetriableStream.drain(RetriableStream.java:310)
io.grpc.internal.RetriableStream.access$1100(RetriableStream.java:55)
io.grpc.internal.RetriableStream$HedgingRunnable$1.run(RetriableStream.java:476)
io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:79)
io.micrometer.core.instrument.Timer.lambda$wrap$0(Timer.java:160)
io.micronaut.scheduling.instrument.InvocationInstrumenterWrappedRunnable.run(InvocationInstrumenterWrappedRunnable.java:47)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
java.base/java.lang.Thread.run(Thread.java:834)
Since writing a message has to be done after start(), it feel like that buffer leak should be unrelated. But that stack trace is in the same hedging draining code, which does make it suspiciously related.
which does make it suspiciously related.
Yeah!
I do see several TODOs in the class ("TODO(b/145386688)" - https://github.com/grpc/grpc-java/commit/d3c77f2d870baf8c8340890eb5aed590a5f3940c ) that may be related to this race condition 🤷?
Example: https://github.com/grpc/grpc-java/blob/5bb721e21785e51a93f7532355d20e8f3f9b4cce/core/src/main/java/io/grpc/internal/RetriableStream.java#L451-L453
https://github.com/grpc/grpc-java/blob/5bb721e21785e51a93f7532355d20e8f3f9b4cce/core/src/main/java/io/grpc/internal/RetriableStream.java#L168-L169
https://github.com/grpc/grpc-java/blob/5bb721e21785e51a93f7532355d20e8f3f9b4cce/core/src/main/java/io/grpc/internal/RetriableStream.java#L178-L179
https://github.com/grpc/grpc-java/blob/5bb721e21785e51a93f7532355d20e8f3f9b4cce/core/src/main/java/io/grpc/internal/RetriableStream.java#L412-L413
I don't think those guarded by things are that important. All those locks are the same instance, just aliases. So it is more just that the code is written in a way that isn't able to be verified by tooling.
I believe this issue became more serious in 1.53. There was a change in 1.53 that fixed some lifecycle issues, but when this is triggered I think it would cause the RPC to hang and never receive its final status. So if you are seeing this issue, you may need to avoid 1.53 and 1.54 until a patch is out.