grpc-java icon indicating copy to clipboard operation
grpc-java copied to clipboard

Ocasional NullPointerException in io.grpc.internal.RetriableStream.drain(RetriableStream.java:279) when using hedgingPolicy

Open srmrz opened this issue 3 years ago • 9 comments

What version of gRPC-Java are you using?

Version 1.42.2

What is your environment?

Linux

What did you expect to see?

I'm using the hedging retry policy via configuration and ocasionally see a NullPointerException pop up.

Here's a snippet of the Kotlin code that configures the hedging policy with an 85ms hedging delay:

.defaultServiceConfig(
    mapOf(
        "loadBalancingPolicy" to "round_robin",
        "methodConfig" to listOf(
            mapOf(
                "name" to listOf(
                    mapOf(
                        "service" to "my.org.Service",
                        "method" to "MyMethod"
                    )
                ),
                "waitForReady" to true,
                "hedgingPolicy" to mapOf(
                    "maxAttempts" to 2.
                    "hedgingDelay" to "0.085s",
                    "nonFatalStatusCodes" to listOf(Status.UNAVAILABLE.code.name)
                )
            )
        )
    )
)

What did you see instead?

java.lang.NullPointerException: null
    at io.grpc.internal.RetriableStream.drain(RetriableStream.java:279)
    at io.grpc.internal.RetriableStream.access$1100(RetriableStream.java:55)
    at io.grpc.internal.RetriableStream$HedgingRunnable$1.run(RetriableStream.java:476)
    at io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:79)
    at io.micrometer.core.instrument.Timer.lambda$wrap$0(Timer.java:160)
    at io.micronaut.scheduling.instrument.InvocationInstrumenterWrappedRunnable.run(InvocationInstrumenterWrappedRunnable.java:47)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.lang.Thread.run(Thread.java:834)

Steps to reproduce the bug

Enable hedging policy above

srmrz avatar May 18 '22 19:05 srmrz

Interesting. I think this is a regression from https://github.com/grpc/grpc-java/commit/cbda32a3c1c8e253d8d78afef4b5f934f0db1826 . It seems state.buffer == null which means passthrough == true which means some other stream was committed and this stream should be cancelled. But it hadn't processed the enough of the buffer yet to start the stream which means it has trouble cancelling it.

ejona86 avatar May 18 '22 20:05 ejona86

This is a race and seems could only be avoided by avoiding hedging. Retries would still be safe.

ejona86 avatar May 18 '22 20:05 ejona86

I see. Thanks @ejona86 for triaging this so quickly.

Hedging is working really well for us to reduce tail latency. I'm hoping we can continue using it and help resolve the race condition.

srmrz avatar May 18 '22 20:05 srmrz

Actually... what was the impact to the RPC in this case? It looks like this might just be logged but the RPC could still complete successfully. Which means things may mostly work if you just ignore the log.

There is a risk of a memory leak developing, but since it didn't end up starting the stream, I think gRPC wouldn't leak any memory when using the Netty transport and a connection is available. If there is no connection available, it might leak.

ejona86 avatar May 18 '22 21:05 ejona86

This is a race and seems could only be avoided by avoiding hedging

This was in the context of a workaround for the problem.

ejona86 avatar May 18 '22 21:05 ejona86

Actually... what was the impact to the RPC in this case? It looks like this might just be logged but the RPC could still complete successfully. Which means things may mostly work if you just ignore the log.

From what I can tell, all requests are succeeding. This is just logged on the side.

That said, I did see 1 single instance of this message:

LEAK: ByteBuf.release() was not called before it's garbage-collected. See https://netty.io/wiki/reference-counted-objects.html for more information.
Recent access records: 
Created at:
	io.grpc.netty.shaded.io.netty.buffer.PooledByteBufAllocator.newDirectBuffer(PooledByteBufAllocator.java:402)
	io.grpc.netty.shaded.io.netty.buffer.AbstractByteBufAllocator.directBuffer(AbstractByteBufAllocator.java:187)
	io.grpc.netty.shaded.io.netty.buffer.AbstractByteBufAllocator.buffer(AbstractByteBufAllocator.java:123)
	io.grpc.netty.shaded.io.grpc.netty.NettyWritableBufferAllocator.allocate(NettyWritableBufferAllocator.java:51)
	io.grpc.internal.MessageFramer.writeKnownLengthUncompressed(MessageFramer.java:226)
	io.grpc.internal.MessageFramer.writeUncompressed(MessageFramer.java:168)
	io.grpc.internal.MessageFramer.writePayload(MessageFramer.java:141)
	io.grpc.internal.AbstractStream.writeMessage(AbstractStream.java:65)
	io.grpc.internal.ForwardingClientStream.writeMessage(ForwardingClientStream.java:37)
	io.grpc.internal.RetriableStream$1SendMessageEntry.runWith(RetriableStream.java:549)
	io.grpc.internal.RetriableStream.drain(RetriableStream.java:310)
	io.grpc.internal.RetriableStream.access$1100(RetriableStream.java:55)
	io.grpc.internal.RetriableStream$HedgingRunnable$1.run(RetriableStream.java:476)
	io.micrometer.core.instrument.composite.CompositeTimer.record(CompositeTimer.java:79)
	io.micrometer.core.instrument.Timer.lambda$wrap$0(Timer.java:160)
	io.micronaut.scheduling.instrument.InvocationInstrumenterWrappedRunnable.run(InvocationInstrumenterWrappedRunnable.java:47)
	java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	java.base/java.lang.Thread.run(Thread.java:834)

srmrz avatar May 18 '22 21:05 srmrz

Since writing a message has to be done after start(), it feel like that buffer leak should be unrelated. But that stack trace is in the same hedging draining code, which does make it suspiciously related.

ejona86 avatar May 18 '22 21:05 ejona86

which does make it suspiciously related.

Yeah!

I do see several TODOs in the class ("TODO(b/145386688)" - https://github.com/grpc/grpc-java/commit/d3c77f2d870baf8c8340890eb5aed590a5f3940c ) that may be related to this race condition 🤷?

Example: https://github.com/grpc/grpc-java/blob/5bb721e21785e51a93f7532355d20e8f3f9b4cce/core/src/main/java/io/grpc/internal/RetriableStream.java#L451-L453

https://github.com/grpc/grpc-java/blob/5bb721e21785e51a93f7532355d20e8f3f9b4cce/core/src/main/java/io/grpc/internal/RetriableStream.java#L168-L169

https://github.com/grpc/grpc-java/blob/5bb721e21785e51a93f7532355d20e8f3f9b4cce/core/src/main/java/io/grpc/internal/RetriableStream.java#L178-L179

https://github.com/grpc/grpc-java/blob/5bb721e21785e51a93f7532355d20e8f3f9b4cce/core/src/main/java/io/grpc/internal/RetriableStream.java#L412-L413

srmrz avatar May 19 '22 17:05 srmrz

I don't think those guarded by things are that important. All those locks are the same instance, just aliases. So it is more just that the code is written in a way that isn't able to be verified by tooling.

ejona86 avatar May 19 '22 18:05 ejona86

I believe this issue became more serious in 1.53. There was a change in 1.53 that fixed some lifecycle issues, but when this is triggered I think it would cause the RPC to hang and never receive its final status. So if you are seeing this issue, you may need to avoid 1.53 and 1.54 until a patch is out.

ejona86 avatar Apr 03 '23 22:04 ejona86