orleans icon indicating copy to clipboard operation
orleans copied to clipboard

Ungraceful shutdown leading to indefinite startup failures

Open jsteinich opened this issue 2 years ago • 9 comments

Observed environment:

  • Orleans 3.0.2
  • DynamoDB Clustering
  • Single silo
  • Containerized

Scenario:

  • Silo (A) fails in an ungraceful manner (membership entry still marked active)
  • New silo starts up
  • New silo fails due to https://github.com/dotnet/orleans/issues/4664
  • New silos continuing starting and failing until the 10 minute timeout expires
  • New silo (B) starts and successfully marks itself as active in the membership table
  • Silo B continues startup and executes configured startup tasks (default stage)
  • One of the startup tasks makes a grain call
  • Grain call tries to active the grain on the dead silo (A)
  • Grain activation fails and throws an exception
  • The startup task doesn't catch that exception and it bubbles up as an unhandled exception
  • Silo (B) process is terminated due to the exception
  • Since the shutdown is ungraceful, the loop repeats until membership entries are manually cleaned up

Specific issues:

  • The initial failures are by design to account for temporary network outages. I can tweak some configuration options to adjust that window.
  • Grain trying to activate on a dead silo. I'm wondering if this is related to https://github.com/dotnet/orleans/issues/5536. Ideally the membership and grain directory could better coordinate in this scenario.
  • Startup task failure leading to ungraceful shutdown. I can wrap the startup tasks, but perhaps this scenario could be handled by a more graceful shutdown.

Specific exception trace:

Unhandled exception. Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException
: Unable to connect to endpoint S127.0.0.1:22254:417104037. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 127.0.0.1:22254. Error: ConnectionRefused
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken)
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken)
   at Orleans.Internal.OrleansTaskExtentions.MakeCancellable[T](Task`1 task, CancellationToken cancellationToken)
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address)
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint)
   at Orleans.Runtime.Messaging.OutboundMessageQueue.<SendMessage>g__SendAsync|9_0(ValueTask`1 c, Message m)
   at Orleans.Runtime.OutgoingCallInvoker.Invoke()
   at Orleans.Runtime.OutgoingCallInvoker.Invoke()
   at Orleans.Runtime.GrainReferenceRuntime.InvokeWithFilters(GrainReference reference, InvokeMethodRequest request, String debugContext, InvokeMethodOptions options)
   at Orleans.Internal.OrleansTaskExtentions.<ToTypedTask>g__ConvertAsync|4_0[T](Task`1 asyncTask)
   at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount)
   at Orleans.Runtime.Scheduler.AsyncClosureWorkItem`1.Execute()
   at Orleans.Runtime.Placement.RandomPlacementDirector.OnSelectActivation(PlacementStrategy strategy, GrainId target, IPlacementRuntime context)
   at Orleans.Runtime.Placement.PlacementDirectorsManager.SelectOrAddActivation(ActivationAddress sendingAddress, PlacementTarget targetGrain, IPlacementRuntime context, PlacementStrategy strategy)
   at Orleans.Runtime.Dispatcher.AddressMessageAsync(Message message, PlacementTarget target, PlacementStrategy strategy, ActivationAddress targetAddress)
   at Orleans.Runtime.Dispatcher.<>c__DisplayClass36_0.<<AsyncSendMessage>g__TransportMessageAferSending|0>d.MoveNext()
--- End of stack trace from previous location ---
   at Orleans.Runtime.OutgoingCallInvoker.Invoke()
   at Orleans.Runtime.OutgoingCallInvoker.Invoke()
   at Orleans.Runtime.GrainReferenceRuntime.InvokeWithFilters(GrainReference reference, InvokeMethodRequest request, String debugContext, InvokeMethodOptions options)
   at Orleans.Internal.OrleansTaskExtentions.<ToTypedTask>g__ConvertAsync|4_0[T](Task`1 asyncTask)
   at PerBlue.Common.GameServer.Grains.Configuration.RuntimeConfigurationService.RequestInitialConfiguration()
   at Orleans.Runtime.SiloLifecycleSubject.MonitoredObserver.OnStart(CancellationToken ct)
   at Orleans.LifecycleSubject.<OnStart>g__CallOnStart|7_0(Int32 stage, OrderedObserver observer, CancellationToken cancellationToken)
   at Orleans.LifecycleSubject.OnStart(CancellationToken ct)
   at Orleans.Runtime.Scheduler.AsyncClosureWorkItem.Execute()
   at Orleans.Runtime.Silo.StartAsync(CancellationToken cancellationToken)
   at Orleans.Hosting.SiloWrapper.StartAsync(CancellationToken cancellationToken)
   at Orleans.Hosting.SiloHostedService.StartAsync(CancellationToken cancellationToken)
   at Microsoft.Extensions.Hosting.Internal.Host.StartAsync(CancellationToken cancellationToken)
   at ...

jsteinich avatar Mar 21 '23 15:03 jsteinich

I attempted just doing a graceful shutdown on any startup task failure. This does result in skipping the 10 minute wait cycle, but does not actually resolve the issue.

Looking into a bit further, I can see that the RandomPlacementDirector calls into the SiloStatusOracle to get active silos which simply looks at the membership status: https://github.com/dotnet/orleans/blob/ec31259418fcc574d575bbb70427719d18cc522d/src/Orleans.Runtime/MembershipService/SiloStatusOracle.cs#L74

The bad silo is also unable to be voted dead as the new silo shuts down before it gets a chance to run silo probes.

jsteinich avatar Mar 21 '23 20:03 jsteinich

There were quite a few changes between 3.0.2 and 3.6.5, one of which may have rectified this issue. Is there something preventing an upgrade? I recommend that before diving too deeply into this

ReubenBond avatar Mar 21 '23 21:03 ReubenBond

There were quite a few changes between 3.0.2 and 3.6.5, one of which may have rectified this issue. Is there something preventing an upgrade? I recommend that before diving too deeply into this

I attempted a quick upgrade, but ran into some dependency conflicts. I also see that there are some breaking changes.

I'm hoping that we'll be able to start an upgrade to 7.x in the near future, but if that fails to materialize, I'll revisit the 3.6.5 update.

jsteinich avatar Mar 22 '23 14:03 jsteinich

@ReubenBond I tested this again after upgrading to Orleans 7 and the behavior is the same.

I currently have a workaround of wrapping startup tasks and keeping track of failure status. If any failed, I'm using IClusterMembershipService.TryKill to cleanup bad entries so that the next startup is successful.

jsteinich avatar Apr 10 '23 16:04 jsteinich

Startup task failure leading to ungraceful shutdown. I can wrap the startup tasks, but perhaps this scenario could be handled by a more graceful shutdown.

I think we should implement this. We shouldn't be ungracefully shutting down the silo just because application code failed.

ReubenBond avatar Apr 11 '23 15:04 ReubenBond

I think we should implement this. We shouldn't be ungracefully shutting down the silo just because application code failed.

There also appears to be an issue with the grain directory not respecting the "I am alive" timeout that the membership system uses (not explicitly application code failure). Here's an updated trace of that:

Orleans.Runtime.OrleansMessageRejectionException: Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to endpoint S127.0.0.1:22253:40146827. See InnerException
 ---> Orleans.Networking.Shared.SocketConnectionException: Unable to connect to 127.0.0.1:22253. Error: ConnectionRefused
   at Orleans.Networking.Shared.SocketConnectionFactory.ConnectAsync(EndPoint endpoint, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/Shared/SocketConnectionFactory.cs:line 54
   at Orleans.Runtime.Messaging.ConnectionFactory.ConnectAsync(SiloAddress address, CancellationToken cancellationToken) in /_/src/Orleans.Core/Networking/ConnectionFactory.cs:line 61
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 193
   --- End of inner exception stack trace ---
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 221
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 224
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 74
   at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 727
   at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 30
   at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 338
   at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync|40_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 439
   at Orleans.Serialization.Invocation.ResponseCompletionSource.GetResult(Int16 token) in /_/src/Orleans.Serialization/Invocation/ResponseCompletionSource.cs:line 81
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 117
   at Orleans.Runtime.ActivityPropagationGrainCallFilter.Process(IGrainCallContext context, Activity activity) in /_/src/Orleans.Core/Diagnostics/ActivityPropagationGrainCallFilter.cs:line 51
   at Orleans.Runtime.OutgoingCallInvoker`1.Invoke() in /_/src/Orleans.Core/Runtime/OutgoingCallInvoker.cs:line 88
   at Orleans.Runtime.GrainReferenceRuntime.InvokeMethodWithFiltersAsync[TResult](GrainReference reference, IInvokable request, InvokeMethodOptions options) in /_/src/Orleans.Core/Runtime/GrainReferenceRuntime.cs:line 74

jsteinich avatar Apr 11 '23 17:04 jsteinich

This looks very similar to what we're experiencing with an app on v3.7.1. The difference is we're using SQL Server for everything. I also see the grain directory message above in some stack traces, though that is but one of different variations.

JorgeCandeias avatar Nov 04 '23 18:11 JorgeCandeias

Any progress on this? Also see such exceptions. I use k8s clustering with orleans 8

hankovich avatar Jan 15 '24 14:01 hankovich

@hankovich does your application ever start? Are you able to provide more detail, potentially including logs?

ReubenBond avatar Jan 15 '24 20:01 ReubenBond