orleans icon indicating copy to clipboard operation
orleans copied to clipboard

UpdateIAmAlive retries indefinitely even though it won't be able to succeed

Open oleggolovkov opened this issue 2 years ago • 9 comments

I had this issue with Redis clustering but I believe it may be more generic

Steps to reproduce:

  1. Spin up a typical cluster with non-localhost clustering and let it operate for some time
  2. Make an entry for one of the silos disappear from clustering storage without touching the cluster itself. One possibility to do so is to DELETE a row from the table if ADO.NET clustering is used. The other possibility (which happens in my case) is to have Redis clustering with no 'Redis persistence' and simply reboot this Redis instance
  3. Observe exceptions like this happening every few seconds till the end of times Failed to update table entry for this silo, will retry shortly: "Orleans.Clustering.Redis.RedisClusteringException: Could not find a value for the key S10.0.1.101:8952:392386255 at Orleans.Clustering.Redis.RedisMembershipTable.UpdateIAmAlive(MembershipEntry entry) at Orleans.Runtime.MembershipService.MembershipTableManager.UpdateIAmAlive() in /_/src/Orleans.Runtime/MembershipService/MembershipTableManager.cs:line 201 at Orleans.Runtime.MembershipService.MembershipAgent.UpdateIAmAlive() in /_/src/Orleans.Runtime/MembershipService/MembershipAgent.cs:line 72"

Actual result: Orleans.Runtime.MembershipService.MembershipAgent enters a loop which it can't resolve itself

Expected result: Orleans.Runtime.MembershipService.MembershipAgent is able to understand permanent failures (in my example the error message contains Could not find a value for the key) and does not retry those (in my example it explicitly states that it will: will retry shortly) but instead does something else like trying to reconfigure the whole cluster or consider itself a newly joining silo

oleggolovkov avatar Jun 09 '22 10:06 oleggolovkov

How should we address this? Generally speaking, there's no way to know which errors are transient and which are permanent/fatal.

ReubenBond avatar Jun 09 '22 16:06 ReubenBond

I'd suggest maybe we could use something similar to what exists in IClusterClient in Task Connect(Func<Exception, Task<bool>> retryFilter = null); where we can register our own handler to tell Orleans if it should continue retries or do something else

oleggolovkov avatar Jun 09 '22 16:06 oleggolovkov

@oleggolovkov would that be a filter specific to each clustering provider or specific for IAmAlive or would it be a global filter?

ReubenBond avatar Jun 09 '22 16:06 ReubenBond

The main idea is that sometimes the cluster can get into a (somewhat failed) state which it is unable to recover automatically from, like in this case the only solution I was able to find is rebooting the silos. If the only place in the system where infinite retries are happening is this case with updating IAmAlive then we could have a specific filter for that; if there are other places - would be nice to have either a global filter or specific filters for those places as well.

oleggolovkov avatar Jun 09 '22 17:06 oleggolovkov

I think this issue is related https://github.com/OrleansContrib/Orleans.Redis/issues/13

The other possibility (which happens in my case) is to have Redis clustering with no 'Redis persistence' and simply reboot this Redis instance

Probably the same happened to me too, since I had such error and the redis has no persistence too

ScarletKuro avatar Jun 09 '22 20:06 ScarletKuro

We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.

ghost avatar Jul 28 '22 23:07 ghost

I had the same problem today with Microsoft.Orleans.Clustering.Redis 8.0.0 in K8s

{
  "@timestamp": "2024-02-06T19:07:16.3144540+03:30",
  "level": "Error",
  "messageTemplate": "Failed to update table entry for this silo, will retry shortly",
  "message": "Failed to update table entry for this silo, will retry shortly",
  "exception": {
    "Depth": 0,
    "ClassName": "Orleans.Clustering.Redis.RedisClusteringException",
    "Message": "Could not find a value for the key S10.233.75.237:11111:66128158",
    "Source": "Orleans.Clustering.Redis",
    "StackTraceString": "   at Orleans.Clustering.Redis.RedisMembershipTable.UpdateIAmAlive(MembershipEntry entry) in /_/src/Redis/Orleans.Clustering.Redis/Storage/RedisMembershipTable.cs:line 156\n   at Orleans.Runtime.MembershipService.MembershipTableManager.UpdateIAmAlive() in /_/src/Orleans.Runtime/MembershipService/MembershipTableManager.cs:line 204\n   at Orleans.Runtime.MembershipService.MembershipAgent.UpdateIAmAlive() in /_/src/Orleans.Runtime/MembershipService/MembershipAgent.cs:line 71",
    "RemoteStackTraceString": null,
    "RemoteStackIndex": 0,
    "HResult": -2146233088,
    "HelpURL": null
  },
  "EventId": {
    "Id": 100659
  },
  "SourceContext": "Orleans.Runtime.MembershipService.MembershipAgent",
  "ElasticApmServiceName": "Okala_MS_Discount_API",
  "ElasticApmServiceVersion": "1.0.0",
  "ElasticApmServiceNodeName": null,
  "ElasticApmGlobalLabels": {},
  "MachineName": "ms-discount-57fd88dd86-xxd2c",
  "EnvironmentName": "Production",
  "ExceptionDetail": {
    "HResult": -2146233088,
    "Message": "Could not find a value for the key S10.233.75.237:11111:66128158",
    "Source": "Orleans.Clustering.Redis",
    "TargetSite": "Void MoveNext()",
    "Type": "Orleans.Clustering.Redis.RedisClusteringException"
  },
  "ApplicationName": "Okala.MS.Discount.API"
}

ArminShoeibi avatar Feb 06 '24 16:02 ArminShoeibi

@ArminShoeibi did Redis restart and was no persistence configured? I recommend using Azure Storage for clustering instead

ReubenBond avatar Feb 06 '24 18:02 ReubenBond

@ArminShoeibi did Redis restart and was no persistence configured? I recommend using Azure Storage for clustering instead

Yes @ReubenBond , only one of our pods crashed twice due to being OOM Killed. We've implemented Redis sentinel with three sentinels and two slaves, and our Redis instance is set up to save data to disk, serving as our primary database, which has proven to be quite stable.

I should note that after enabling Redis grain directory, I encountered the mentioned error.

ArminShoeibi avatar Feb 06 '24 22:02 ArminShoeibi