orleans icon indicating copy to clipboard operation
orleans copied to clipboard

Exceptions during graceful shutdown

Open krasin-ga opened this issue 6 years ago • 6 comments

I am getting exceptions after trying to gracefully shutdown one of two silos. And sometimes silo can't shutdown at all and just hangs.

On the silo that being shutdown:

warn Orleans.Runtime.GrainDirectory.LocalGrainDirectory
     RegisterAsync - It seems we are not the owner of activation S127.0.0.1:11111:304870891*grn/PublicationContent/000055ed@f9fbaa9b, trying to forward it to S127.0.0.1:11111:304870891 (hopCount=1)

warn Orleans.Runtime.GrainDirectory.LocalGrainDirectory
     RegisterAsync - It seems we are not the owner of activation S127.0.0.1:11111:304870891*grn/LinkToPublication/00000000+http://***/Center/news?id=1027267@d565b0a1, trying to forward it to S127.0.0.1:11111:304870891 (hopCount=1)

warn Orleans.Runtime.GrainDirectory.LocalGrainDirectory
     RegisterAsync - It seems we are not the owner of activation S127.0.0.1:11111:304870891*grn/PublicationContent/00005253@48188ba0, trying to forward it to S127.0.0.1:11111:304870891 (hopCount=1)

fail Orleans.Runtime.Dispatcher
     SelectTarget failed with Current directory at S127.0.0.1:11111:304870891 is not stable to perform the lookup for grainId *grn/SmiGrain/000002c5 (it maps to S127.0.0.1:11112:304870915, which is not a valid silo). Retry later.
       ExceptionType Orleans.Runtime.OrleansException
       ExceptionMessage Current directory at S127.0.0.1:11111:304870891 is not stable to perform the lookup for grainId *grn/721BE62B/000002c5 (it maps to S127.0.0.1:11112:304870915, which is not a valid silo). Retry later.

fail Orleans.Runtime.Catalog
     Failed to RegisterActivationInGrainDirectory for [Activation: S127.0.0.1:11112:304870915*grn/PublicationContent/00005254@6f2869a6 #GrainType=OrleansTesting.Grains.Publications.PublicationContent Placement=RandomPlacement State=Invalid].
       ExceptionType System.ArgumentNullException
       ExceptionMessage Value cannot be null.
Parameter name: existingActivationAddress
       ExceptionSource Orleans.Runtime
       ExceptionStackTrace    at Orleans.Runtime.Catalog.RegisterActivationInGrainDirectoryAndValidate(ActivationData activation) in D:\build\agent\_work\12\s\src\Orleans.Runtime\Catalog\Catalog.cs:line 0
   at Orleans.Runtime.Catalog.InitActivation(ActivationData activation, String grainType, String genericArguments, Dictionary`2 requestContextData) in D:\build\agent\_work\12\s\src\Orleans.Runtime\Catalog\Catalog.cs:line 546
   
   fail Orleans.Runtime.HostedClient
     RunClientMessagePump has thrown exception
       ExceptionType System.OperationCanceledException
       ExceptionMessage The operation was canceled.
       ExceptionSource System.Collections.Concurrent
       ExceptionStackTrace    at System.Collections.Concurrent.BlockingCollection`1.TryTakeWithNoTimeValidation(T& item, Int32 millisecondsTimeout, CancellationToken cancellationToken, CancellationTokenSource combinedTokenSource)
   at System.Collections.Concurrent.BlockingCollection`1.TryTake(T& item, Int32 millisecondsTimeout, CancellationToken cancellationToken)
   at System.Collections.Concurrent.BlockingCollection`1.Take(CancellationToken cancellationToken)
   at Orleans.Runtime.HostedClient.RunClientMessagePump() in D:\build\agent\_work\12\s\src\Orleans.Runtime\Core\HostedClient.cs:line 0
   
   fail Orleans.Runtime.Catalog
     UnregisterManyAsync 84 failed.
       ExceptionType System.InvalidOperationException
       ExceptionMessage Grain directory is stopping
       ExceptionSource Orleans.Runtime
       ExceptionStackTrace    at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.CheckIfShouldForward(GrainId grainId, Int32 hopCount, String operationDescription) in D:\build\agent\_work\12\s\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 563
   at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.UnregisterOrPutInForwardList(IEnumerable`1 addresses, UnregistrationCause cause, Int32 hopCount, Dictionary`2& forward, List`1 tasks, String context) in D:\build\agent\_work\12\s\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 726
   at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.UnregisterManyAsync(List`1 addresses, UnregistrationCause cause, Int32 hopCount) in D:\build\agent\_work\12\s\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 773
   at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute() in D:\build\agent\_work\12\s\src\Orleans.Runtime\Scheduler\ClosureWorkItem.cs:line 63
   at Orleans.Runtime.Catalog.FinishDestroyActivations(List`1 list, Int32 number, MultiTaskCompletionSource tcs) in D:\build\agent\_work\12\s\src\Orleans.Runtime\Catalog\Catalog.cs:line 995
       ExceptionEntryAssembly OrleansTesting.Silo, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null

On client \ live silo:

fail 
       ExceptionType Orleans.Runtime.OrleansException
       ExceptionMessage Current directory at S127.0.0.1:11111:304870891 is not stable to perform the lookup for grainId *grn/CBBF4FF4/00000000+baltija.eu (it maps to S127.0.0.1:11112:304870915, which is not a valid silo). Retry later.
       ExceptionSource Orleans.Runtime

fail 
       ExceptionType Orleans.Runtime.OrleansException
       ExceptionMessage Current directory at S127.0.0.1:11111:304870891 is not stable to perform the lookup for grainId *grn/7B5BF3AD/00005223 (it maps to S127.0.0.1:11112:304870915, which is not a valid silo). Retry later.
       ExceptionSource Orleans.Runtime

fail 
       ExceptionType Orleans.Runtime.OrleansException
       ExceptionMessage Current directory at S127.0.0.1:11111:304870891 is not stable to perform the lookup for grainId *grn/721BE62B/000002c5 (it maps to S127.0.0.1:11112:304870915, which is not a valid silo). Retry later.
       ExceptionSource Orleans.Runtime
	   
	   fail Orleans.Runtime.Dispatcher
     SelectTarget failed with Current directory at S127.0.0.1:11111:304875582 is not stable to perform the lookup for grainId *grn/PublicationContent/00005be7 (it maps to S127.0.0.1:11112:304875663, which is not a valid silo). Retry later.
       ExceptionType Orleans.Runtime.OrleansException
       ExceptionMessage Current directory at S127.0.0.1:11111:304875582 is not stable to perform the lookup for grainId *grn/E1CB458F/00005be7 (it maps to S127.0.0.1:11112:304875663, which is not a valid silo). Retry later.
       ExceptionSource Orleans.Runtime
       ExceptionStackTrace    at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in D:\build\agent\_work\12\s\src\Orleans.Runtime\GrainDirectory\LocalGrainDirectory.cs:line 928
   at Orleans.Runtime.Scheduler.AsyncClosureWorkItem`1.Execute() in D:\build\agent\_work\12\s\src\Orleans.Runtime\Scheduler\ClosureWorkItem.cs:line 94
   at Orleans.Runtime.Placement.RandomPlacementDirector.OnSelectActivation(PlacementStrategy strategy, GrainId target, IPlacementRuntime context) in D:\build\agent\_work\12\s\src\Orleans.Runtime\Placement\RandomPlacementDirector.cs:line 15
   at Orleans.Runtime.Placement.PlacementDirectorsManager.SelectOrAddActivation(ActivationAddress sendingAddress, PlacementTarget targetGrain, IPlacementRuntime context, PlacementStrategy strategy) in D:\build\agent\_work\12\s\src\Orleans.Runtime\Placement\PlacementDirectorsManager.cs:line 97
   at Orleans.Runtime.Dispatcher.AddressMessageAsync(Message message, PlacementTarget target, PlacementStrategy strategy, ActivationAddress targetAddress) in D:\build\agent\_work\12\s\src\Orleans.Runtime\Core\Dispatcher.cs:line 788
   at Orleans.Runtime.Dispatcher.<>c__DisplayClass37_0.<<AsyncSendMessage>b__1>d.MoveNext() in D:\build\agent\_work\12\s\src\Orleans.Runtime\Core\Dispatcher.cs:line 704

krasin-ga avatar Aug 30 '19 16:08 krasin-ga

I wrote unit test to illustrate unexpected behavior. This unit test is non deterministic, so it needs to be run several times before it fails.

Tester\Forwarding\ShutdownSiloTests.cs master\v2.4.2

[SkippableFact, TestCategory("GracefulShutdown"), TestCategory("Functional")]
public async Task SiloGracefulShutdown_NoExceptionsOnClient()
{
    var queriesProcessed = 0;
    var exceptions = new ConcurrentQueue<Exception>();

    const int maxDegreeOfParallelism = 1000;
    const int delayBeforeStoppingSilo = 1500;

    async Task CreateTrafficFromClient()
    {
        async Task QuerySomethingFromGrain(long id)
        {
            try
            {
                await HostedCluster.Client.GetGrain<ISimpleGrain>(id).GetA();
                Interlocked.Increment(ref queriesProcessed);
            }
            catch (Exception exception)
            {
                exceptions.Enqueue(exception);
            }
        }

        using (var semaphore = new SemaphoreSlim(maxDegreeOfParallelism))
            for (var id = 0; ; id++)
            {
                await semaphore.WaitAsync();
                QuerySomethingFromGrain(id++ % maxDegreeOfParallelism)
                    .ContinueWith(t => semaphore.Release())
                    .Ignore();
            }
    }

    _ = Task.Run(() =>CreateTrafficFromClient());
    await Task.Delay(delayBeforeStoppingSilo);

    var secondarySilo = HostedCluster.SecondarySilos.First();
    await secondarySilo.StopSiloAsync(stopGracefully: true);

    Assert.True(queriesProcessed > 0);
    var noExceptions = exceptions.IsEmpty;

    while (!exceptions.IsEmpty)
    {
        exceptions.TryDequeue(out var exception);
        _testOutputHelper.WriteLine(exception.ToString());
    }

    Assert.True(noExceptions);
}

Exceptions are different from time to time:

Orleans.Runtime.OrleansException: Current directory at S127.0.0.1:22760:305109219 is not stable to perform the lookup for grainId *grn/901FCCD4/00000050 (it maps to S127.0.0.1:22761:305109221, which is not a valid silo). Retry later.

Orleans.Runtime.OrleansMessageRejectionException: Forwarding failed: tried to forward message NewPlacement Request S127.0.0.1:24881:305109182cli/74ee6493@30a473b3->S127.0.0.1:24880:305109181grn/901FCCD4/00000394@cc517fb8 #26742[ForwardCount=2]: for 2 times after Duplicate activation to invalid activation. Rejecting now.

Orleans.Runtime.OrleansMessageRejectionException: Exception sending message to S127.0.0.1:47751:0. Message: Request cli/9086e19d@dde693a4->S127.0.0.1:47751:0grn/901FCCD4/0000017c #42965: . System.Net.Sockets.SocketException (0x80004005): An existing connection was forcibly closed by the remote host

krasin-ga avatar Sep 02 '19 08:09 krasin-ga

These are the same errors I keep getting. @sergeybykov Do you have any update on this?

Jain-Nidhi avatar Jun 08 '21 02:06 Jain-Nidhi

What version of Orleans are you using?

benjaminpetit avatar Jun 08 '21 17:06 benjaminpetit

I am using Orleans 3.3

On Tue, Jun 8, 2021, 12:52 PM Benjamin Petit @.***> wrote:

What version of Orleans are you using?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dotnet/orleans/issues/5922#issuecomment-856972243, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADCDGBOT7O4LWMMROX7VTZDTRZKGHANCNFSM4ISPGIWQ .

Jain-Nidhi avatar Jun 09 '21 15:06 Jain-Nidhi

I've seen the same issues, on Orleans 3.4.1

emilekberg avatar Sep 07 '21 11:09 emilekberg

We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.

ghost avatar Jul 28 '22 23:07 ghost

I'm seeing these errors too when gracefully shutting down the host. I'm using Orleans 3.6.5.

Any updates on this issue?

bill-poole avatar Mar 03 '23 05:03 bill-poole

I'm also having the same issue on Orleans 3.6.2 hosting on Kubernetes.

rpatel372 avatar Mar 03 '23 20:03 rpatel372