orleans
orleans copied to clipboard
Exceptions during graceful shutdown
I am getting exceptions after trying to gracefully shutdown one of two silos. And sometimes silo can't shutdown at all and just hangs.
On the silo that being shutdown:
warn Orleans.Runtime.GrainDirectory.LocalGrainDirectory
RegisterAsync - It seems we are not the owner of activation S127.0.0.1:11111:304870891*grn/PublicationContent/000055ed@f9fbaa9b, trying to forward it to S127.0.0.1:11111:304870891 (hopCount=1)
warn Orleans.Runtime.GrainDirectory.LocalGrainDirectory
RegisterAsync - It seems we are not the owner of activation S127.0.0.1:11111:304870891*grn/LinkToPublication/00000000+http://***/Center/news?id=1027267@d565b0a1, trying to forward it to S127.0.0.1:11111:304870891 (hopCount=1)
warn Orleans.Runtime.GrainDirectory.LocalGrainDirectory
RegisterAsync - It seems we are not the owner of activation S127.0.0.1:11111:304870891*grn/PublicationContent/00005253@48188ba0, trying to forward it to S127.0.0.1:11111:304870891 (hopCount=1)
fail Orleans.Runtime.Dispatcher
SelectTarget failed with Current directory at S127.0.0.1:11111:304870891 is not stable to perform the lookup for grainId *grn/SmiGrain/000002c5 (it maps to S127.0.0.1:11112:304870915, which is not a valid silo). Retry later.
ExceptionType Orleans.Runtime.OrleansException
ExceptionMessage Current directory at S127.0.0.1:11111:304870891 is not stable to perform the lookup for grainId *grn/721BE62B/000002c5 (it maps to S127.0.0.1:11112:304870915, which is not a valid silo). Retry later.
fail Orleans.Runtime.Catalog
Failed to RegisterActivationInGrainDirectory for [Activation: S127.0.0.1:11112:304870915*grn/PublicationContent/00005254@6f2869a6 #GrainType=OrleansTesting.Grains.Publications.PublicationContent Placement=RandomPlacement State=Invalid].
ExceptionType System.ArgumentNullException
ExceptionMessage Value cannot be null.
Parameter name: existingActivationAddress
ExceptionSource Orleans.Runtime
ExceptionStackTrace at Orleans.Runtime.Catalog.RegisterActivationInGrainDirectoryAndValidate(ActivationData activation) in D:\build\agent\_work\12\s\src\Orleans.Runtime\Catalog\Catalog.cs:line 0
at Orleans.Runtime.Catalog.InitActivation(ActivationData activation, String grainType, String genericArguments, Dictionary`2 requestContextData) in D:\build\agent\_work\12\s\src\Orleans.Runtime\Catalog\Catalog.cs:line 546
fail Orleans.Runtime.HostedClient
RunClientMessagePump has thrown exception
ExceptionType System.OperationCanceledException
ExceptionMessage The operation was canceled.
ExceptionSource System.Collections.Concurrent
ExceptionStackTrace at System.Collections.Concurrent.BlockingCollection`1.TryTakeWithNoTimeValidation(T& item, Int32 millisecondsTimeout, CancellationToken cancellationToken, CancellationTokenSource combinedTokenSource)
at System.Collections.Concurrent.BlockingCollection`1.TryTake(T& item, Int32 millisecondsTimeout, CancellationToken cancellationToken)
at System.Collections.Concurrent.BlockingCollection`1.Take(CancellationToken cancellationToken)
at Orleans.Runtime.HostedClient.RunClientMessagePump() in D:\build\agent\_work\12\s\src\Orleans.Runtime\Core\HostedClient.cs:line 0
fail Orleans.Runtime.Catalog
UnregisterManyAsync 84 failed.
ExceptionType System.InvalidOperationException
ExceptionMessage Grain directory is stopping
ExceptionSource Orleans.Runtime
ExceptionStackTrace at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.CheckIfShouldForward(GrainId grainId, Int32 hopCount, String operationDescription) in D:\build\agent\_work\12\s\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 563
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.UnregisterOrPutInForwardList(IEnumerable`1 addresses, UnregistrationCause cause, Int32 hopCount, Dictionary`2& forward, List`1 tasks, String context) in D:\build\agent\_work\12\s\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 726
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.UnregisterManyAsync(List`1 addresses, UnregistrationCause cause, Int32 hopCount) in D:\build\agent\_work\12\s\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 773
at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute() in D:\build\agent\_work\12\s\src\Orleans.Runtime\Scheduler\ClosureWorkItem.cs:line 63
at Orleans.Runtime.Catalog.FinishDestroyActivations(List`1 list, Int32 number, MultiTaskCompletionSource tcs) in D:\build\agent\_work\12\s\src\Orleans.Runtime\Catalog\Catalog.cs:line 995
ExceptionEntryAssembly OrleansTesting.Silo, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null
On client \ live silo:
fail
ExceptionType Orleans.Runtime.OrleansException
ExceptionMessage Current directory at S127.0.0.1:11111:304870891 is not stable to perform the lookup for grainId *grn/CBBF4FF4/00000000+baltija.eu (it maps to S127.0.0.1:11112:304870915, which is not a valid silo). Retry later.
ExceptionSource Orleans.Runtime
fail
ExceptionType Orleans.Runtime.OrleansException
ExceptionMessage Current directory at S127.0.0.1:11111:304870891 is not stable to perform the lookup for grainId *grn/7B5BF3AD/00005223 (it maps to S127.0.0.1:11112:304870915, which is not a valid silo). Retry later.
ExceptionSource Orleans.Runtime
fail
ExceptionType Orleans.Runtime.OrleansException
ExceptionMessage Current directory at S127.0.0.1:11111:304870891 is not stable to perform the lookup for grainId *grn/721BE62B/000002c5 (it maps to S127.0.0.1:11112:304870915, which is not a valid silo). Retry later.
ExceptionSource Orleans.Runtime
fail Orleans.Runtime.Dispatcher
SelectTarget failed with Current directory at S127.0.0.1:11111:304875582 is not stable to perform the lookup for grainId *grn/PublicationContent/00005be7 (it maps to S127.0.0.1:11112:304875663, which is not a valid silo). Retry later.
ExceptionType Orleans.Runtime.OrleansException
ExceptionMessage Current directory at S127.0.0.1:11111:304875582 is not stable to perform the lookup for grainId *grn/E1CB458F/00005be7 (it maps to S127.0.0.1:11112:304875663, which is not a valid silo). Retry later.
ExceptionSource Orleans.Runtime
ExceptionStackTrace at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in D:\build\agent\_work\12\s\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 928
at Orleans.Runtime.Scheduler.AsyncClosureWorkItem`1.Execute() in D:\build\agent\_work\12\s\src\Orleans.Runtime\Scheduler\ClosureWorkItem.cs:line 94
at Orleans.Runtime.Placement.RandomPlacementDirector.OnSelectActivation(PlacementStrategy strategy, GrainId target, IPlacementRuntime context) in D:\build\agent\_work\12\s\src\Orleans.Runtime\Placement\RandomPlacementDirector.cs:line 15
at Orleans.Runtime.Placement.PlacementDirectorsManager.SelectOrAddActivation(ActivationAddress sendingAddress, PlacementTarget targetGrain, IPlacementRuntime context, PlacementStrategy strategy) in D:\build\agent\_work\12\s\src\Orleans.Runtime\Placement\PlacementDirectorsManager.cs:line 97
at Orleans.Runtime.Dispatcher.AddressMessageAsync(Message message, PlacementTarget target, PlacementStrategy strategy, ActivationAddress targetAddress) in D:\build\agent\_work\12\s\src\Orleans.Runtime\Core\Dispatcher.cs:line 788
at Orleans.Runtime.Dispatcher.<>c__DisplayClass37_0.<<AsyncSendMessage>b__1>d.MoveNext() in D:\build\agent\_work\12\s\src\Orleans.Runtime\Core\Dispatcher.cs:line 704
I wrote unit test to illustrate unexpected behavior. This unit test is non deterministic, so it needs to be run several times before it fails.
Tester\Forwarding\ShutdownSiloTests.cs master\v2.4.2
[SkippableFact, TestCategory("GracefulShutdown"), TestCategory("Functional")]
public async Task SiloGracefulShutdown_NoExceptionsOnClient()
{
var queriesProcessed = 0;
var exceptions = new ConcurrentQueue<Exception>();
const int maxDegreeOfParallelism = 1000;
const int delayBeforeStoppingSilo = 1500;
async Task CreateTrafficFromClient()
{
async Task QuerySomethingFromGrain(long id)
{
try
{
await HostedCluster.Client.GetGrain<ISimpleGrain>(id).GetA();
Interlocked.Increment(ref queriesProcessed);
}
catch (Exception exception)
{
exceptions.Enqueue(exception);
}
}
using (var semaphore = new SemaphoreSlim(maxDegreeOfParallelism))
for (var id = 0; ; id++)
{
await semaphore.WaitAsync();
QuerySomethingFromGrain(id++ % maxDegreeOfParallelism)
.ContinueWith(t => semaphore.Release())
.Ignore();
}
}
_ = Task.Run(() =>CreateTrafficFromClient());
await Task.Delay(delayBeforeStoppingSilo);
var secondarySilo = HostedCluster.SecondarySilos.First();
await secondarySilo.StopSiloAsync(stopGracefully: true);
Assert.True(queriesProcessed > 0);
var noExceptions = exceptions.IsEmpty;
while (!exceptions.IsEmpty)
{
exceptions.TryDequeue(out var exception);
_testOutputHelper.WriteLine(exception.ToString());
}
Assert.True(noExceptions);
}
Exceptions are different from time to time:
Orleans.Runtime.OrleansException: Current directory at S127.0.0.1:22760:305109219 is not stable to perform the lookup for grainId *grn/901FCCD4/00000050 (it maps to S127.0.0.1:22761:305109221, which is not a valid silo). Retry later.
Orleans.Runtime.OrleansMessageRejectionException: Forwarding failed: tried to forward message NewPlacement Request S127.0.0.1:24881:305109182cli/74ee6493@30a473b3->S127.0.0.1:24880:305109181grn/901FCCD4/00000394@cc517fb8 #26742[ForwardCount=2]: for 2 times after Duplicate activation to invalid activation. Rejecting now.
Orleans.Runtime.OrleansMessageRejectionException: Exception sending message to S127.0.0.1:47751:0. Message: Request cli/9086e19d@dde693a4->S127.0.0.1:47751:0grn/901FCCD4/0000017c #42965: . System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host
These are the same errors I keep getting. @sergeybykov Do you have any update on this?
What version of Orleans are you using?
I am using Orleans 3.3
On Tue, Jun 8, 2021, 12:52 PM Benjamin Petit @.***> wrote:
What version of Orleans are you using?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dotnet/orleans/issues/5922#issuecomment-856972243, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCDGBOT7O4LWMMROX7VTZDTRZKGHANCNFSM4ISPGIWQ .
I've seen the same issues, on Orleans 3.4.1
We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.
I'm seeing these errors too when gracefully shutting down the host. I'm using Orleans 3.6.5.
Any updates on this issue?
I'm also having the same issue on Orleans 3.6.2 hosting on Kubernetes.