[Orleans 7] DeactivateOnIdle may cause deadlock when grain has timer
Grain code:
public interface ITestGrain : IGrainWithGuidKey
{
Task TimerTick();
Task Connect();
Task Disconnect();
}
public class TestGrain : Grain, ITestGrain
{
private readonly ILogger _logger;
private IDisposable? _timer;
public TestGrain(ILogger<TestGrain> logger)
{
_logger = logger;
}
public Task Connect()
{
_timer = RegisterTimer(
_ => this.AsReference<ITestGrain>().TimerTick(),
null,
TimeSpan.FromSeconds(1),
TimeSpan.FromSeconds(1)
);
return Task.CompletedTask;
}
public async Task Disconnect()
{
await Task.Delay(1000);
DeactivateOnIdle();
}
public async Task TimerTick()
{
await Task.Delay(100);
_logger.LogInformation("Timer ticked!");
}
}
Client code:
[HttpGet]
public async Task Test()
{
var grain = _clusterClient.GetGrain<ITestGrain>(Guid.Empty);
await grain.Connect();
await grain.Disconnect();
}
When i executed client code a few times, It most likely appear deadlock, Then TimeoutException thrown in DeactivateOnIdle method:
Exception when trying to deactivate [Activation: S127.0.0.1:11111:35964260/test/00000000000000000000000000000000@ae30efc925a74d97937d9532889c1d7a#GrainType=OrleansTimerTest.Grains.TestGrain,OrleansTimerTest Placement=RandomPlacement State=Deactivating] System.Threading.Tasks.TaskCanceledException: A task was canceled. at Orleans.Internal.OrleansTaskExtentions.MakeCancellable(Task task, CancellationToken cancellationToken) in /_/src/Orleans.Core/Async/TaskExtensions.cs:line 188 at Orleans.Runtime.ActivationData.FinishDeactivating(CancellationToken cancellationToken) in /_/src/Orleans.Runtime/Catalog/ActivationData.cs:line 1380
After DeactivateOnIdle failed, this grain cannot be activated again, every request to this grain would return error:
OrleansMessageRejectionException: Forwarding failed: tried to forward message Request [S127.0.0.1:11111:35964260 sys.client/hosted-127.0.0.1:11111@35964260]->[S127.0.0.1:11111:35964260 test/00000000000000000000000000000000] OrleansTimerTest.Grains.ITestGrainOrleansTimerTest.Grains.ITestGrain.Connect() #65[ForwardCount=2] for 2 times after "Failed to register activation in grain directory." to invalid activation. Rejecting now. Orleans.Serialization.Invocation.ResponseCompletionSource.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(short token) in ResponseCompletionSource.cs, line 107
And i look up the source code of DeactivateOnIdle, I found it contains some code that is waiting for timer's asyncCallback task completion, I think it is root cause of this problem, The code is at below:
private Task WaitForAllTimersToFinish(CancellationToken cancellationToken)
{
lock (this)
{
if (Timers is null)
{
return Task.CompletedTask;
}
var tasks = new List<Task>();
var timerCopy = Timers.ToList(); // need to copy since OnTimerDisposed will change the timers set.
foreach (var timer in timerCopy)
{
// first call dispose, then wait to finish.
Utils.SafeExecute(timer.Dispose, _shared.Logger, "timer.Dispose has thrown");
tasks.Add(timer.GetCurrentlyExecutingTickTask());
}
return Task.WhenAll(tasks).WithCancellation(cancellationToken);
}
}
I came across a similar issue, but it was related to an exception happening in the timer code, but there is no exception handling around WaitForAllTimersToFinish. This lack of exception handling leaves the grain a in inconsistent state where it can never be activated again.
Fixed by #8951