grpc-java
grpc-java copied to clipboard
Race condition in idleTimer & ManagedChannel#enterIdle - refactor needed
Edit (2021/12/15): The following bug is fixed with a hot patch #8746. However, the logic of idleTimer in ManagedChannel is complex, delicate, and bug-prone. Refactoring is needed to make it more robust and easy to understand.
What version of gRPC-Java are you using?
1.41.0, current master: 8382bd8e04347478bcf483319fb0b376f726023d
What is your environment?
Java 11/16, MacOS. Original user report is probably from a Linux server.
What did you expect to see?
We trigger ManagedChannel#enterIdle API periodically while also keeping ManagedChannelImpl#idleTimer with default settings. We expect ManagedChannelImpl to work properly and not to end up in panic.
What did you see instead?
Caused by: io.grpc.StatusRuntimeException: INTERNAL: Panic! This is a bug!\
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:262)\
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:243)\
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:156)\
...
Caused by: java.lang.IllegalStateException: nameResolver is not started\
at com.google.common.base.Preconditions.checkState(Preconditions.java:502)\
at io.grpc.internal.ManagedChannelImpl.shutdownNameResolverAndLoadBalancer(ManagedChannelImpl.java:360)\
at io.grpc.internal.ManagedChannelImpl.enterIdleMode(ManagedChannelImpl.java:422)\
at io.grpc.internal.ManagedChannelImpl.access$900(ManagedChannelImpl.java:118)\
at io.grpc.internal.ManagedChannelImpl$IdleModeTimer.run(ManagedChannelImpl.java:352)\
at io.grpc.internal.Rescheduler$ChannelFutureRunnable.run(Rescheduler.java:103)\
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:95)\
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:127)\
at io.grpc.internal.Rescheduler$FutureRunnable.run(Rescheduler.java:80)\
at io.grpc.netty.shaded.io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)\
at io.grpc.netty.shaded.io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)\
The channel is broken after it.
Steps to reproduce the bug
- Apply the change to
io.grpc.testing.integration.XdsTestClient: https://github.com/Spikhalskiy/grpc-java/commit/86a000095bd0632a3b9c54d82097f9d85d2c350a - Start
io.grpc.testing.integration.XdsTestServer#main - Start
io.grpc.testing.integration.XdsTestClient#mainwith--num_channels=1000 - Find the reported exception in
XdsTestClientlogs
Relevant report
The original user report in Temporal JavaSDK: https://github.com/temporalio/sdk-java/issues/863
According to the report, the issue happened after a long period of channel inactivity (> idleTimer period).
My best guess is that our periodic #enterIdle shuts down the nameResolver while still-scheduled (cancel(permanent=false)) idleTimer sees a stale enabled=true and performs the action. But I don't see how this can happen.
@ejona86 This seems a bug. Looking at the code, I don't really understand the comment https://github.com/grpc/grpc-java/blob/v1.42.1/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L418-L422
It seems not true if enterIdle() is called even before the resolver gets started.
Also, kinda irrelevant, but to work around this I have to temporary disable idleTimer and found this code in ManagedChannelImplBuilder#idleTimeout:
if (unit.toDays(value) >= IDLE_MODE_MAX_TIMEOUT_DAYS) {
// This disables idle mode
this.idleTimeoutMillis = ManagedChannelImpl.IDLE_TIMEOUT_MILLIS_DISABLE;
So to disable the idleTimer, I need to pass a value larger than 30 days. Passing ManagedChannelImpl.IDLE_TIMEOUT_MILLIS_DISABLE in (which is -1) doesn't work and leads to an exception.
Would we great if ManagedChannelBuilder provides a better interface for this. I would personally prefer having a public constant equals to -1 and mentioning if this in the ManagedChannelBuilder#idleTimeout javadoc.
It seems not true if enterIdle() is called even before the resolver gets started.
Why do you say that? NameResolver is initialized in the constructor. https://github.com/grpc/grpc-java/blob/408ffe86b0414b8dede350b86be9d53a7785a449/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L662
It looks like this is a timer bug. The timer firing is racing with the channel being forced into idle (which cancels the timer) by the outside caller.
enterIdle() has a check to avoid re-entering idle mode (lbHelper == null):
https://github.com/grpc/grpc-java/blob/408ffe86b0414b8dede350b86be9d53a7785a449/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L1353-L1366
But the idle timer doesn't, nor does Rescheduler, as it just treats the SynchronizationContext as an Executor. https://github.com/grpc/grpc-java/blob/408ffe86b0414b8dede350b86be9d53a7785a449/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L348-L354
We need to either make the timer cancellation non-racy, or we need to add double-check behavior in IdleModeTimer. I generally prefer making timer cancellation non-racy, but Rescheduler will probably make that more annoying.
For disabling idleTimeout, the expectation is you'd pass Long.MAX_VALUE or the like. Seems we are missing language like we have on keepAliveTime. We should add something similar to idleTimeout:
Long.MAX_VALUE nano seconds or an unreasonably large value will disable keepalive
It seems not true if enterIdle() is called even before the resolver gets started.
Why do you say that? NameResolver is initialized in the constructor.
Sorry, I mixed up NameResolver being null, and NameResolver not started. The comment in the code is correct, the NameResolver is not null. My comment should have been:
The following checkState seems not true if enterIdle() is called even before the resolver gets started:
https://github.com/grpc/grpc-java/blob/408ffe86b0414b8dede350b86be9d53a7785a449/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L360
The following checkState seems not true if enterIdle() is called even before the resolver gets started:
If the resolver is not started then the channel is IDLE, thus there's no point in calling enterIdle(). That's why the current semantics are enterIdle() should only be called if non-IDLE. If it is being called when already IDLE (which is a bug in the caller), there are two possible approaches: 1) stop doing that; fix the code that is calling it inappropriately or 2) change its semantics to allow calling even when IDLE.
@ejona86
Approach 2 is already implemented.
Calling enterIdle is already fine if the channel is IDLE. There is a check inside enterIdle for lbHelper==null that makes sure enterIdle does nothing if already IDLE.
Thoughts about why Approach 1 ("stop doing that; fix the code that is calling it inappropriately") is not solving anything:
- Following the exception stacktrace, most likely
enterIdleis called beforeidleTimerand the race happens afterenterIdlebeforeidleTimer. It's a bit unclear how checking for IDLE beforeenterIdlefrom the caller will help here. (exception about the unexpected state is coming from theidleTimerand notenterIdle). - It will not help either because there is effectively a check inside
enterIdlefor IDLE state that is running under SynchronizationContext (https://github.com/grpc/grpc-java/blob/6d260b80318e6d75cdaef8ecacb43b291d885ff9/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L1360). And additional check outside can't help here. So it's not a solution. The problem is race, not double call toenterIdle. ManagedChannelimplementation is supposed to be thread-safe. The Approach4. 1 requirement to perform a check for IDLE before callingenterIdlewill require synchronization (becauseidleTimercan fire in between) which makes it non-thread-safe.- And what is our caller even supposed to use for synchronization described in (3) to make solution 1 happen?
Disabling idleTimer in the provided reproduction fixes the test. So, the problem is obviously not in enterIdle contract, the issue is in race condition.
The problem is PROBABLY is that there is no check inside idleTimer and cancelation of idleTimer inside enterIdle races with its scheduling/executing. So, the absence of check during execution + race between cancelation of scheduling and scheduled execution itself may lead to the race.
This change in IdleTimer fixes the race and saves all the current contracts:
https://github.com/Spikhalskiy/grpc-java/commit/6d260b80318e6d75cdaef8ecacb43b291d885ff9
But I'm not sure if it's the right fix or the race should be addressed somewhere inside the io.grpc.internal.Rescheduler or io.grpc.SynchronizationContext.
Because I would expect that
Rescheduler#cancel(permanent = false)call that setsenabled=falseforidleTimerinsideenterIdleandidleTimer's check of theenabledflag insideChannelFutureRunnable#run
don't race with each other. Because they both are run under SynchronizationContext#execute. And I would expect that SynchronizationContext guarantees happens-before between executions of Runnables. Which looks like it attempts to provide by using a queue to get happens-before between add and poll and by using CAS to get happens-before between the processing of polls that could happen in different threads.
But obviously, according to the fact that https://github.com/Spikhalskiy/grpc-java/commit/6d260b80318e6d75cdaef8ecacb43b291d885ff9 fixes the issue, something is missing. We probably read a stale version of enabled and that's why this additional lbHelper==null check in idleTimer helps.
So, that's why I don't propose https://github.com/Spikhalskiy/grpc-java/commit/6d260b80318e6d75cdaef8ecacb43b291d885ff9 as a fix because this looks like something may be broken with SynchronizationContext and this change may be treating a symptom instead of a bigger issue.
The channel is initially IDLE and if the idleTimer expires too soon, it will try to enterIdleMode() at the initial IDLE state.
I think Spikhalskiy@6d260b8 is the right fix.
@ejona86 what do you think?
@dapengzhang0, I think checking for lbHelper == null would fix this bug. It does make some parts of the code unclear (e.g., why wouldn't enterIdleMode() just do the check?) and can hide other bugs. But it is fast and easy and seems a fair initial fix.
But I'm not sure if it's the right fix or the race should be addressed somewhere inside the
io.grpc.internal.Reschedulerorio.grpc.SynchronizationContext. Because I would expect that
Rescheduler#cancel(permanent = false)call that setsenabled=falseforidleTimerinsideenterIdleandidleTimer's check of theenabledflag insideChannelFutureRunnable#run
don't race with each other. Because they both are run under
SynchronizationContext#execute.
I think the cause of enabled spuriously becoming to true might be:
When enterIdleMode(), the ManagedChannelImpl.inUseStateAggregator was not checking
https://github.com/grpc/grpc-java/blob/df4ac5973c6d382175feea609d684c26f467bc2f/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L429
the in-use status of the InternalSubchannel, unlike exitIdleMode(). So even though Rescheduler#cancel(permanent = false) sets enabled=false, when InternalSubchannel becomes not in-use, rescheduleIdleTimer(), namely Rescheduler.enabled=true, can be called by
https://github.com/grpc/grpc-java/blob/df4ac5973c6d382175feea609d684c26f467bc2f/core/src/main/java/io/grpc/internal/ManagedChannelImpl.java#L2199-L2205
We believe this is a regression introduced by #8425, which was part of 1.41.