orleans icon indicating copy to clipboard operation
orleans copied to clipboard

Silo won't recover

Open pablo-salad opened this issue 1 year ago • 12 comments

Context:

  • We are running a heterogeneous Orleans cluster with a total of 4 silos (2 silos of each configuration).
  • The cluster has been active for approximately 18 hours.
  • We are using the latest NuGet packages for all Orleans and OrleansContrib dependencies.
  • The membership table is hosted in Kubernetes and is accurate, showing 4 active silos and some old, defunct silos (still within the DefunctSiloExpiration period).
  • Silo 10.216.2.133 began exhibiting failure behavior, where:
    • Messages originating from this silo fail. Orleans.Runtime.OrleansException: Current directory at S10.216.2.133:11111:86745024 is not stable to perform the lookup for grainId user/05f621896f744c099f6136809969d981 (it maps to S10.216.3.135:11111:86731531, which is not a valid silo). Retry later.
    • Messages destined for this silo time out. Response did not arrive on time in 00:00:30 for message: Request [S10.216.3.136:11111:86745026 sys.svc.dir.cache-validator/10.216.3.136:11111@86745026]->[S10.216.2.133:11111:86745024 sys.svc.dir.cache-validator/10.216.2.133:11111@86745024] Orleans.Runtime.IRemoteGrainDirectory.LookUpMany(System.Collections.Generic.List1[System.ValueTuple2[Orleans.Runtime.GrainId,System.Int32]]) #1865509. . About to break its promise.

Steps Taken:

  • After identifying the issue with silo 10.216.2.133, we stopped the silo.
  • Once the silo was stopped, the remaining silos picked up the ring change, the cluster resumed normal operations immediately.

Questions/Concerns:

  • How did the cluster end up in this state?
    • We observed the "retry later" error message in the logs for the failing silo but are unsure why the silo did not recover on its own.
  • Is there something missing in our health checks?
    • We want to know if there is a configuration or mechanism we are missing that could detect and prevent a failing silo from continuing to attempt message processing.

pablo-salad avatar Oct 01 '24 19:10 pablo-salad

Do you see anything logs mentioning LocalSiloHealthMonitor? If so, they may be useful. Are you able to post your configuration, particularly ClusterMembershipOptions? It's logged during startup.

ReubenBond avatar Oct 01 '24 19:10 ReubenBond

We have some messages like this: .NET Thread Pool is exhibiting delays of 1.0037756s. This can indicate .NET Thread Pool starvation, very long .NET GC pauses, or other runtime or machine pauses.

The only non default value in ClusterMembershipOptions is DefunctSiloExpiration = 1 day

pablo-salad avatar Oct 02 '24 19:10 pablo-salad

Ok, that Thread Pool delay is an indication that application performance is severely degraded for some reason. Profiling or a memory dump might help to indicate what. Can you tell me more about your scenario? Does your app log at a very high rate? Where are you running your app? If kubernetes, what resources are provided to the pods and do you have CPU limits set?

ReubenBond avatar Oct 02 '24 19:10 ReubenBond

We did rolling restart of the silos with debug logging enabled. Multiple apps login attempts were made to reproduce the Timeout issue previously encountered. The following observations were made:

  • No significant or interesting messages were observed in the logs related to "LocalSiloHealthMonitor."
  • The issue of querying a silo that doesn’t exist still persists.
  • GCP performance dashboard does not show any notable increase in node usage; the usage metrics per server remain within average limits.
  • There is no evident explanation for the "GC pressure" message.
  • No garbage collections (GCs) or thread pool starvation log entries were recorded during the observed time period, further supporting the belief that this is not a performance-related issue.

We were able to get the environment working once we clear up the membership table completely and roll out the nodes again.

Let us know if there is any related information that could help to find the root cause of it.

pablo-salad avatar Oct 14 '24 19:10 pablo-salad

I've been working with @pablo-salad on this issue. Whenever the silos get gummed up, this process seems to be the surefire way to get it working again:

  1. Scale all of our K8s silo deployments to zero.
  2. Manually remove all clusterversions and silos from K8s (using kubectl delete ...).
  3. Scale our K8s silo deployments back up to desired levels.

Although this once occurred during a period of heavy load, over the past couple of weeks this problem has persisted after every new rolling deployment while the servers have been consistently ~20% resource utilization. As noted in the last message from @pablo-salad, there is almost nothing interesting we can find in the debug level logs from Orleans system components.

Would this issue be better served in the Orleans.Clustering.Kubernetes repo?

seniorquico avatar Oct 18 '24 00:10 seniorquico

@seniorquico @pablo-salad thanks for the information. A little more information may be helpful:

  • How many silos do you typically have in a cluster
  • How many grains
  • What is the rate of grain activation (order of magnitude per unit of time)

ReubenBond avatar Oct 18 '24 01:10 ReubenBond

This particular cluster is used for QA. We have a K8s autoscaler attached, and we allow it to scale down to 4 silos. It rarely scales up.

Image

When automated tests are running, we may be talking 1000-2000 grain activations over several minutes. When QA and developers sign-in for something exploratory, it looks like the screenshot above- maybe 100-200 activations over several minutes. We're way overprovisioned, but we want to keep it as similar as we can to our production environment and have had issues in the past when we scale too low.

This "membership" issue (just what it appears to look like to us) has been occurring on every rolling deployment to this QA cluster for the past ~4 weeks. We were finding silly workarounds, but ultimately have landed on the above procedure being the only reliable way to get it working every time. This problem has never occurred in the production cluster (knock on wood!). Hardware wise, it runs on the same GCP class of VMs as our production cluster, but a fully separate GKE cluster/nodes. We use Terraform and Argo CD templates for everything- they're as close to each other as we think we can get.

seniorquico avatar Oct 18 '24 01:10 seniorquico

Thanks for the additional info. When silos restart during the rolling restart, are they able to connect to the cluster? Do you have any startup tasks? Which version of .NET and Orleans are you running? Are you able to provide the contents of the membership table during an outage?

ReubenBond avatar Oct 18 '24 01:10 ReubenBond

When silos restart during the rolling restart, are they able to connect to the cluster?

I'm not quite sure what you mean (technically) by "connect to the cluster". The new silos that come up successfully add themselves to the membership table, and I haven't noticed any network errors in the logs. We have debug-level logging enabled for the Orleans.* namespaces. Is there an example message I could look for to confirm they've "connected"?

It may not be relevant, but we have used this gist as the basis of our K8s health checks, and the health checks continue to pass during the outage:

https://gist.github.com/ReubenBond/2881c9f519910ca4a378d148346e9114

Apologies, I don't know how the local grain call is meant to behave before the silo successfully connects to the cluster.

Do you have any startup tasks?

No.

Which version of .NET and Orleans are you running?

Orleans 8.2.0. Compiled using .NET SDK 8.0.401 and using the .NET 8.0.8 runtime.

Are you able to provide the contents of the membership table during an outage?

It may be a few days until our next rolling deployment. I'll dump the K8s silos and clusterversions before, during, and after and report back.

seniorquico avatar Oct 19 '24 15:10 seniorquico

Apologies for the delayed reply. Here's a ZIP file of kubectl dumps of the membership table:

Membership Table States.zip

In summary... before the rolling deployment (clusterVersion 40):

Silo Address Silo Status
10.216.2.47 Active
10.216.6.39 Active
10.216.6.42 Active
10.216.7.50 Active

... during the rolling deployment (clusterVersion 48):

Silo Address Silo Status
10.216.2.47 Active
10.216.6.39 Active
10.216.6.42 Active
10.216.7.50 Active
10.216.2.58 Active
10.216.2.60 Active
10.216.6.44 Active
10.216.7.52 Active

... after the rolling deployment (clusterVersion 56):

Silo Address Silo Status Suspecting Silos
10.216.2.47 Dead 10.216.2.47:11111@88214300
10.216.6.39 Dead 10.216.6.39:11111@88214293
10.216.6.42 Dead 10.216.6.42:11111@88366789
10.216.7.50 Dead 10.216.7.50:11111@88366788
10.216.2.58 Active
10.216.2.60 Active
10.216.6.44 Active
10.216.7.52 Active

EDIT: Just to confirm, I let it run for a couple of hours after the rolling deployment finished. The silos in the "active" state continued to show up as such in the membership table, but no messages were flowing around. I scaled all of the deployments to zero, deleted all of the clusterversions and silos CRs using kubectl, then scaled everything back up. It all came back up working as expected.

seniorquico avatar Oct 28 '24 02:10 seniorquico

This problem has never occurred in the production cluster (knock on wood!)

@seniorquico , do you also have 4 silos (pods) in production environment? If you set replica to 6 or more, would this issue get disappeared?

shykln avatar Apr 02 '25 05:04 shykln

seniorquico, do you also have 4 silos (pods) in production environment? If you set replica to 6 or more, would this issue get disappeared?

@shykln The production environment has a K8s autoscaler running. It fluctuates between 4 and 8 silos depending on load. Right now, it's running with 4 silos. We've definitely performed rolling deployments when the production cluster was at this minimum silo count. We still haven't encountered this problem in the production environment, which is still weird.

seniorquico avatar Apr 02 '25 06:04 seniorquico

I'm not sure if this is the exact same scenario, but it's the same symptom- A silo ran away with memory, and K8s killed it (OOMKilled). Unfortunately, three hours later, and the 5 silos that were running in the cluster were all logging directory failures. We had to restart all of the silos (rolling restart) to get it back to healthy. Examples:

Failed to address message Request [S10.52.7.168:11111:113473568 sys.client/255994194b134e049939e3f5db1bb27c]->[ containergroup/salad-sie-transcribe-90e3b0d8] Salad.Cloud.ContainerGroups.Grains.IContainerGroupGrain.GetContainerGroupAsync() #260A9BD3C5DB525F
Orleans.Runtime.OrleansException: Current directory at S10.52.7.168:11111:113473568 is not stable to perform the lookup for grainId containergroup/salad-sie-transcribe-90e3b0d8 (it maps to S10.52.7.169:11111:113473570, which is not a valid silo). Retry later.
   at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 697
   at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29
   at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 373
   at Orleans.Runtime.Placement.PlacementService.PlacementWorker.AddressWaitingMessages(GrainPlacementWorkItem completedWorkItem) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 339
--- End of stack trace from previous location ---
   at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync|40_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 476
Failed to address message IsAlwaysInterleave Request [S10.52.0.34:11111:113476741 sys.client/hosted-10.52.0.34:11111@113476741]->[ queueproducer/790e02a8c78d4415b01af6f5929352ec] Salad.Cloud.Queues.Grains.IQueueProducerGrain.CreateAndEnqueueJobAsync(Salad.Cloud.Queues.Grains.Models.CreateJobCommand, System.Guid) #28ECCC85397B802E
Orleans.Runtime.OrleansException: Current directory at S10.52.0.34:11111:113476741 is not stable to perform the lookup for grainId queueproducer/790e02a8c78d4415b01af6f5929352ec (it maps to S10.52.7.169:11111:113473570, which is not a valid silo). Retry later.
   at Orleans.Runtime.GrainDirectory.LocalGrainDirectory.LookupAsync(GrainId grainId, Int32 hopCount) in /_/src/Orleans.Runtime/GrainDirectory/LocalGrainDirectory.cs:line 697
   at Orleans.Runtime.GrainDirectory.DhtGrainLocator.Lookup(GrainId grainId) in /_/src/Orleans.Runtime/GrainDirectory/DhtGrainLocator.cs:line 29
   at Orleans.Runtime.Placement.PlacementService.PlacementWorker.GetOrPlaceActivationAsync(Message firstMessage) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 373
   at Orleans.Runtime.Placement.PlacementService.PlacementWorker.AddressWaitingMessages(GrainPlacementWorkItem completedWorkItem) in /_/src/Orleans.Runtime/Placement/PlacementService.cs:line 339
--- End of stack trace from previous location ---
   at Orleans.Runtime.Messaging.MessageCenter.<AddressAndSendMessage>g__SendMessageAsync|40_0(Task addressMessageTask, Message m) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 476

I noticed something a bit different- They were all trying to reach S10.52.7.169:11111:113473570, and this silo was dead in the membership table:

address: 10.52.7.169
apiVersion: orleans.dot.net/v1
clusterId: salad-cloud-production
generation: 113473570
hostname: cloud-production-portal-api-server-669b46549d-glbgq
iAmAliveTime: "2025-08-06T09:19:42.524487+00:00"
kind: OrleansSilo
port: 11111
proxyPort: 30000
siloName: cloud-production-portal-api-server-669b46549d-glbgq
startTime: "2025-08-06T08:26:10.379076+00:00"
status: Dead
suspectingSilos:
- 10.52.7.169:11111@113476806
suspectingTimes:
- 2025-08-06 09:20:09.069 GMT

However, K8s had restarted the container in the pod (default "always" restart policy), and it was re-running with a new membership table entry:

address: 10.52.7.169
apiVersion: orleans.dot.net/v1
clusterId: salad-cloud-production
generation: 113476806
hostname: cloud-production-portal-api-server-669b46549d-glbgq
iAmAliveTime: "2025-08-06T09:21:22.035043+00:00"
kind: OrleansSilo
port: 11111
proxyPort: 30000
siloName: cloud-production-portal-api-server-669b46549d-glbgq
startTime: "2025-08-06T09:20:06.349056+00:00"
status: Active
suspectingSilos: []
suspectingTimes: []

From a K8s perspective, the behavior makes sense- it's really the same pod, so the new silo's name and IP is the same as the old. The only difference I see is the generation.

Is the generation critical in the silo addressing? From the logs, it would appear so. The failures appear to be attempting to reach an older generation silo. Why were the remaining silos stuck for 3+ hours trying to reach a silo with an older generation that had been marked "dead" in the membership table long ago?

seniorquico avatar Aug 06 '25 14:08 seniorquico