Ungraceful shutdown leading to indefinite startup failures
Observed environment:
- Orleans 3.0.2
- DynamoDB Clustering
- Single silo
- Containerized
Scenario:
- Silo (A) fails in an ungraceful manner (membership entry still marked active)
- New silo starts up
- New silo fails due to https://github.com/dotnet/orleans/issues/4664
- New silos continuing starting and failing until the 10 minute timeout expires
- New silo (B) starts and successfully marks itself as active in the membership table
- Silo B continues startup and executes configured startup tasks (default stage)
- One of the startup tasks makes a grain call
- Grain call tries to active the grain on the dead silo (A)
- Grain activation fails and throws an exception
- The startup task doesn't catch that exception and it bubbles up as an unhandled exception
- Silo (B) process is terminated due to the exception
- Since the shutdown is ungraceful, the loop repeats until membership entries are manually cleaned up
Specific issues:
- The initial failures are by design to account for temporary network outages. I can tweak some configuration options to adjust that window.
- Grain trying to activate on a dead silo. I'm wondering if this is related to https://github.com/dotnet/orleans/issues/5536. Ideally the membership and grain directory could better coordinate in this scenario.
- Startup task failure leading to ungraceful shutdown. I can wrap the startup tasks, but perhaps this scenario could be handled by a more graceful shutdown.
Specific exception trace:
Unhandled exception. Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException
: Unable to connect to endpoint S127.0.0.1:22254:417104037. See InnerException
---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 127.0.0.1:22254. Error: ConnectionRefused
at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken)
at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken)
at Orleans.Internal.OrleansTaskExtentions.MakeCancellable[T](Task`1 task, CancellationToken cancellationToken)
at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
--- End of inner exception stack trace ---
at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint)
at Orleans.Runtime.Messaging.OutboundMessageQueue.<SendMessage>g__SendAsync|9_0(ValueTask`1 c, Message m)
at Orleans.Runtime.OutgoingCallInvoker.Invoke()
at Orleans.Runtime.OutgoingCallInvoker.Invoke()
at Orleans.Runtime.GrainReferenceRuntime.InvokeWithFilters(GrainReference reference, InvokeMethodRequest request, String debugContext, InvokeMethodOptions options)
at Orleans.Internal.OrleansTaskExtentions.<ToTypedTask>g__ConvertAsync|4_0[T](Task`1 asyncTask)
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount)
at Orleans.Runtime.Scheduler.AsyncClosureWorkItem`1.Execute()
at Orleans.Runtime.Placement.RandomPlacementDirector.OnSelectActivation(PlacementStrategy strategy, GrainId target, IPlacementRuntime context)
at Orleans.Runtime.Placement.PlacementDirectorsManager.SelectOrAddActivation(ActivationAddress sendingAddress, PlacementTarget targetGrain, IPlacementRuntime context, PlacementStrategy strategy)
at Orleans.Runtime.Dispatcher.AddressMessageAsync(Message message, PlacementTarget target, PlacementStrategy strategy, ActivationAddress targetAddress)
at Orleans.Runtime.Dispatcher.<>c__DisplayClass36_0.<<AsyncSendMessage>g__TransportMessageAferSending|0>d.MoveNext()
--- End of stack trace from previous location ---
at Orleans.Runtime.OutgoingCallInvoker.Invoke()
at Orleans.Runtime.OutgoingCallInvoker.Invoke()
at Orleans.Runtime.GrainReferenceRuntime.InvokeWithFilters(GrainReference reference, InvokeMethodRequest request, String debugContext, InvokeMethodOptions options)
at Orleans.Internal.OrleansTaskExtentions.<ToTypedTask>g__ConvertAsync|4_0[T](Task`1 asyncTask)
at PerBlue.Common.GameServer.Grains.Configuration.RuntimeConfigurationService.RequestInitialConfiguration()
at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct)
at Orleans.LifecycleSubject.<OnStart>g__CallOnStart|7_0(Int32 stage, OrderedObserver observer, CancellationToken cancellationToken)
at Orleans.LifecycleSubject.OnStart(CancellationToken ct)
at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute()
at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken)
at Orleans.Hosting.SiloWrapper.StartAsync(CancellationToken cancellationToken)
at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken)
at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
at ...
I attempted just doing a graceful shutdown on any startup task failure. This does result in skipping the 10 minute wait cycle, but does not actually resolve the issue.
Looking into a bit further, I can see that the RandomPlacementDirector calls into the SiloStatusOracle to get active silos which simply looks at the membership status: https://github.com/dotnet/orleans/blob/ec31259418fcc574d575bbb70427719d18cc522d/src/Orleans.Runtime/MembershipService/SiloStatusOracle.cs#L74
The bad silo is also unable to be voted dead as the new silo shuts down before it gets a chance to run silo probes.
There were quite a few changes between 3.0.2 and 3.6.5, one of which may have rectified this issue. Is there something preventing an upgrade? I recommend that before diving too deeply into this
There were quite a few changes between 3.0.2 and 3.6.5, one of which may have rectified this issue. Is there something preventing an upgrade? I recommend that before diving too deeply into this
I attempted a quick upgrade, but ran into some dependency conflicts. I also see that there are some breaking changes.
I'm hoping that we'll be able to start an upgrade to 7.x in the near future, but if that fails to materialize, I'll revisit the 3.6.5 update.
@ReubenBond I tested this again after upgrading to Orleans 7 and the behavior is the same.
I currently have a workaround of wrapping startup tasks and keeping track of failure status. If any failed, I'm using IClusterMembershipService.TryKill to cleanup bad entries so that the next startup is successful.
Startup task failure leading to ungraceful shutdown. I can wrap the startup tasks, but perhaps this scenario could be handled by a more graceful shutdown.
I think we should implement this. We shouldn't be ungracefully shutting down the silo just because application code failed.
I think we should implement this. We shouldn't be ungracefully shutting down the silo just because application code failed.
There also appears to be an issue with the grain directory not respecting the "I am alive" timeout that the membership system uses (not explicitly application code failure). Here's an updated trace of that:
Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S127.0.0.1:22253:40146827. See InnerException
---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 127.0.0.1:22253. Error: ConnectionRefused
at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 54
at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 61
at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 193
--- End of inner exception stack trace ---
at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 221
at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117
at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 74
at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 727
at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 30
at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 338
at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync|40_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 439
at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117
at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51
at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 74
This looks very similar to what we're experiencing with an app on v3.7.1. The difference is we're using SQL Server for everything. I also see the grain directory message above in some stack traces, though that is but one of different variations.
Any progress on this? Also see such exceptions. I use k8s clustering with orleans 8
@hankovich does your application ever start? Are you able to provide more detail, potentially including logs?