grpc-java Race condition in idleTimer & ManagedChannel#enterIdle

Edit (2021/12/15): The following bug is fixed with a hot patch #8746. However, the logic of idleTimer in ManagedChannel is complex, delicate, and bug-prone. Refactoring is needed to make it more robust and easy to understand.

What version of gRPC-Java are you using?

1.41.0, current master: 8382bd8e04347478bcf483319fb0b376f726023d

What is your environment?

Java 11/16, MacOS. Original user report is probably from a Linux server.

What did you expect to see?

We trigger ManagedChannel#enterIdle API periodically while also keeping ManagedChannelImpl#idleTimer with default settings. We expect ManagedChannelImpl to work properly and not to end up in panic.

What did you see instead?

Caused by: io.grpc.StatusRuntimeException: INTERNAL: Panic! This is a bug!\
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)\
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)\
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)\
...

Caused by: java.lang.IllegalStateException: nameResolver is not started\
at com.google.common.base.Preconditions.checkState(Preconditions.java:502)\
at io.grpc.internal.ManagedChannelImpl.shutdownNameResolverAndLoadBalancer(ManagedChannelImpl.java:360)\
at io.grpc.internal.ManagedChannelImpl.enterIdleMode(ManagedChannelImpl.java:422)\
at io.grpc.internal.ManagedChannelImpl.access$900(ManagedChannelImpl.java:118)\
at io.grpc.internal.ManagedChannelImpl$IdleModeTimer.run(ManagedChannelImpl.java:352)\
at io.grpc.internal.Rescheduler$ChannelFutureRunnable.run(Rescheduler.java:103)\
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)\
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)\
at io.grpc.internal.Rescheduler$FutureRunnable.run(Rescheduler.java:80)\
at io.grpc.netty.shaded.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\
at io.grpc.netty.shaded.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)\

The channel is broken after it.

Steps to reproduce the bug

Apply the change to io.grpc.testing.integration.XdsTestClient: https://github.com/Spikhalskiy/grpc-java/commit/86a000095bd0632a3b9c54d82097f9d85d2c350a
Start io.grpc.testing.integration.XdsTestServer#main
Start io.grpc.testing.integration.XdsTestClient#main with --num_channels=1000
Find the reported exception in XdsTestClient logs

Relevant report

The original user report in Temporal JavaSDK: https://github.com/temporalio/sdk-java/issues/863 According to the report, the issue happened after a long period of channel inactivity (> idleTimer period). My best guess is that our periodic #enterIdle shuts down the nameResolver while still-scheduled (cancel(permanent=false)) idleTimer sees a stale enabled=true and performs the action. But I don't see how this can happen.

Nov 18 '21 22:11 Spikhalskiy

@ejona86 This seems a bug. Looking at the code, I don't really understand the comment https://github.com/grpc/grpc-java/blob/v1.42.1/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L418-L422

It seems not true if enterIdle() is called even before the resolver gets started.

Nov 19 '21 19:11 dapengzhang0

Also, kinda irrelevant, but to work around this I have to temporary disable idleTimer and found this code in ManagedChannelImplBuilder#idleTimeout:

    if (unit.toDays(value) >= IDLE_MODE_MAX_TIMEOUT_DAYS) {
      // This disables idle mode
      this.idleTimeoutMillis = ManagedChannelImpl.IDLE_TIMEOUT_MILLIS_DISABLE;

So to disable the idleTimer, I need to pass a value larger than 30 days. Passing ManagedChannelImpl.IDLE_TIMEOUT_MILLIS_DISABLE in (which is -1) doesn't work and leads to an exception. Would we great if ManagedChannelBuilder provides a better interface for this. I would personally prefer having a public constant equals to -1 and mentioning if this in the ManagedChannelBuilder#idleTimeout javadoc.

Nov 19 '21 20:11 Spikhalskiy

It seems not true if enterIdle() is called even before the resolver gets started.

Why do you say that? NameResolver is initialized in the constructor. https://github.com/grpc/grpc-java/blob/408ffe86b0414b8dede350b86be9d53a7785a449/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L662

It looks like this is a timer bug. The timer firing is racing with the channel being forced into idle (which cancels the timer) by the outside caller.

enterIdle() has a check to avoid re-entering idle mode (lbHelper == null): https://github.com/grpc/grpc-java/blob/408ffe86b0414b8dede350b86be9d53a7785a449/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L1353-L1366

But the idle timer doesn't, nor does Rescheduler, as it just treats the SynchronizationContext as an Executor. https://github.com/grpc/grpc-java/blob/408ffe86b0414b8dede350b86be9d53a7785a449/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L348-L354

We need to either make the timer cancellation non-racy, or we need to add double-check behavior in IdleModeTimer. I generally prefer making timer cancellation non-racy, but Rescheduler will probably make that more annoying.

For disabling idleTimeout, the expectation is you'd pass Long.MAX_VALUE or the like. Seems we are missing language like we have on keepAliveTime. We should add something similar to idleTimeout:

Long.MAX_VALUE nano seconds or an unreasonably large value will disable keepalive

Nov 20 '21 00:11 ejona86

It seems not true if enterIdle() is called even before the resolver gets started.

Why do you say that? NameResolver is initialized in the constructor.

Sorry, I mixed up NameResolver being null, and NameResolver not started. The comment in the code is correct, the NameResolver is not null. My comment should have been:

The following checkState seems not true if enterIdle() is called even before the resolver gets started:

https://github.com/grpc/grpc-java/blob/408ffe86b0414b8dede350b86be9d53a7785a449/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L360

Nov 22 '21 22:11 dapengzhang0

The following checkState seems not true if enterIdle() is called even before the resolver gets started:

If the resolver is not started then the channel is IDLE, thus there's no point in calling enterIdle(). That's why the current semantics are enterIdle() should only be called if non-IDLE. If it is being called when already IDLE (which is a bug in the caller), there are two possible approaches: 1) stop doing that; fix the code that is calling it inappropriately or 2) change its semantics to allow calling even when IDLE.

Nov 22 '21 22:11 ejona86

@ejona86 Approach 2 is already implemented. Calling enterIdle is already fine if the channel is IDLE. There is a check inside enterIdle for lbHelper==null that makes sure enterIdle does nothing if already IDLE.

Thoughts about why Approach 1 ("stop doing that; fix the code that is calling it inappropriately") is not solving anything:

Following the exception stacktrace, most likely enterIdle is called before idleTimer and the race happens after enterIdle before idleTimer. It's a bit unclear how checking for IDLE before enterIdle from the caller will help here. (exception about the unexpected state is coming from the idleTimer and not enterIdle).
It will not help either because there is effectively a check inside enterIdle for IDLE state that is running under SynchronizationContext (https://github.com/grpc/grpc-java/blob/6d260b80318e6d75cdaef8ecacb43b291d885ff9/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L1360). And additional check outside can't help here. So it's not a solution. The problem is race, not double call to enterIdle.
ManagedChannel implementation is supposed to be thread-safe. The Approach4. 1 requirement to perform a check for IDLE before calling enterIdle will require synchronization (because idleTimer can fire in between) which makes it non-thread-safe.
And what is our caller even supposed to use for synchronization described in (3) to make solution 1 happen?

Disabling idleTimer in the provided reproduction fixes the test. So, the problem is obviously not in enterIdle contract, the issue is in race condition. The problem is PROBABLY is that there is no check inside idleTimer and cancelation of idleTimer inside enterIdle races with its scheduling/executing. So, the absence of check during execution + race between cancelation of scheduling and scheduled execution itself may lead to the race.

Nov 23 '21 00:11 Spikhalskiy

This change in IdleTimer fixes the race and saves all the current contracts: https://github.com/Spikhalskiy/grpc-java/commit/6d260b80318e6d75cdaef8ecacb43b291d885ff9

But I'm not sure if it's the right fix or the race should be addressed somewhere inside the io.grpc.internal.Rescheduler or io.grpc.SynchronizationContext. Because I would expect that

Rescheduler#cancel(permanent = false) call that sets enabled=false for idleTimer inside enterIdle and
idleTimer's check of the enabled flag inside ChannelFutureRunnable#run

don't race with each other. Because they both are run under SynchronizationContext#execute. And I would expect that SynchronizationContext guarantees happens-before between executions of Runnables. Which looks like it attempts to provide by using a queue to get happens-before between add and poll and by using CAS to get happens-before between the processing of polls that could happen in different threads.

But obviously, according to the fact that https://github.com/Spikhalskiy/grpc-java/commit/6d260b80318e6d75cdaef8ecacb43b291d885ff9 fixes the issue, something is missing. We probably read a stale version of enabled and that's why this additional lbHelper==null check in idleTimer helps. So, that's why I don't propose https://github.com/Spikhalskiy/grpc-java/commit/6d260b80318e6d75cdaef8ecacb43b291d885ff9 as a fix because this looks like something may be broken with SynchronizationContext and this change may be treating a symptom instead of a bigger issue.

Nov 23 '21 00:11 Spikhalskiy

The channel is initially IDLE and if the idleTimer expires too soon, it will try to enterIdleMode() at the initial IDLE state.

I think Spikhalskiy@6d260b8 is the right fix.

@ejona86 what do you think?

Dec 06 '21 23:12 dapengzhang0

@dapengzhang0, I think checking for lbHelper == null would fix this bug. It does make some parts of the code unclear (e.g., why wouldn't enterIdleMode() just do the check?) and can hide other bugs. But it is fast and easy and seems a fair initial fix.

Dec 07 '21 16:12 ejona86

But I'm not sure if it's the right fix or the race should be addressed somewhere inside the io.grpc.internal.Rescheduler or io.grpc.SynchronizationContext. Because I would expect that

Rescheduler#cancel(permanent = false) call that sets enabled=false for idleTimer inside enterIdle and

idleTimer's check of the enabled flag inside ChannelFutureRunnable#run

don't race with each other. Because they both are run under SynchronizationContext#execute.

I think the cause of enabled spuriously becoming to true might be:

When enterIdleMode(), the ManagedChannelImpl.inUseStateAggregator was not checking

https://github.com/grpc/grpc-java/blob/df4ac5973c6d382175feea609d684c26f467bc2f/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L429

the in-use status of the InternalSubchannel, unlike exitIdleMode(). So even though Rescheduler#cancel(permanent = false) sets enabled=false, when InternalSubchannel becomes not in-use, rescheduleIdleTimer(), namely Rescheduler.enabled=true, can be called by

https://github.com/grpc/grpc-java/blob/df4ac5973c6d382175feea609d684c26f467bc2f/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L2199-L2205

Dec 07 '21 20:12 dapengzhang0

We believe this is a regression introduced by #8425, which was part of 1.41.

Dec 14 '21 19:12 ejona86

grpc-java
grpc-java copied to clipboard

Race condition in idleTimer & ManagedChannel#enterIdle - refactor needed

What version of gRPC-Java are you using?

What is your environment?

What did you expect to see?

What did you see instead?

Steps to reproduce the bug

Relevant report

grpc-java grpc-java copied to clipboard

Race condition in idleTimer & ManagedChannel#enterIdle - refactor needed

What version of gRPC-Java are you using?

What is your environment?

What did you expect to see?

What did you see instead?

Steps to reproduce the bug

Relevant report

grpc-java
grpc-java copied to clipboard