orleans
orleans copied to clipboard
UpdateIAmAlive retries indefinitely even though it won't be able to succeed
I had this issue with Redis clustering but I believe it may be more generic
Steps to reproduce:
- Spin up a typical cluster with non-localhost clustering and let it operate for some time
- Make an entry for one of the silos disappear from clustering storage without touching the cluster itself. One possibility to do so is to DELETE a row from the table if ADO.NET clustering is used. The other possibility (which happens in my case) is to have Redis clustering with no 'Redis persistence' and simply reboot this Redis instance
- Observe exceptions like this happening every few seconds till the end of times
Failed to update table entry for this silo, will retry shortly: "Orleans.Clustering.Redis.RedisClusteringException: Could not find a value for the key S10.0.1.101:8952:392386255 at Orleans.Clustering.Redis.RedisMembershipTable.UpdateIAmAlive(MembershipEntry entry) at Orleans.Runtime.MembershipService.MembershipTableManager.UpdateIAmAlive() in /_/src/Orleans.Runtime/MembershipService/MembershipTableManager.cs:line 201 at Orleans.Runtime.MembershipService.MembershipAgent.UpdateIAmAlive() in /_/src/Orleans.Runtime/MembershipService/MembershipAgent.cs:line 72"
Actual result:
Orleans.Runtime.MembershipService.MembershipAgent
enters a loop which it can't resolve itself
Expected result:
Orleans.Runtime.MembershipService.MembershipAgent
is able to understand permanent failures (in my example the error message contains Could not find a value for the key
) and does not retry those (in my example it explicitly states that it will: will retry shortly
) but instead does something else like trying to reconfigure the whole cluster or consider itself a newly joining silo
How should we address this? Generally speaking, there's no way to know which errors are transient and which are permanent/fatal.
I'd suggest maybe we could use something similar to what exists in IClusterClient
in
Task Connect(Func<Exception, Task<bool>> retryFilter = null);
where we can register our own handler to tell Orleans if it should continue retries or do something else
@oleggolovkov would that be a filter specific to each clustering provider or specific for IAmAlive
or would it be a global filter?
The main idea is that sometimes the cluster can get into a (somewhat failed) state which it is unable to recover automatically from, like in this case the only solution I was able to find is rebooting the silos. If the only place in the system where infinite retries are happening is this case with updating IAmAlive
then we could have a specific filter for that; if there are other places - would be nice to have either a global filter or specific filters for those places as well.
I think this issue is related https://github.com/OrleansContrib/Orleans.Redis/issues/13
The other possibility (which happens in my case) is to have Redis clustering with no 'Redis persistence' and simply reboot this Redis instance
Probably the same happened to me too, since I had such error and the redis has no persistence too
We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.
I had the same problem today with Microsoft.Orleans.Clustering.Redis 8.0.0
in K8s
{
"@timestamp": "2024-02-06T19:07:16.3144540+03:30",
"level": "Error",
"messageTemplate": "Failed to update table entry for this silo, will retry shortly",
"message": "Failed to update table entry for this silo, will retry shortly",
"exception": {
"Depth": 0,
"ClassName": "Orleans.Clustering.Redis.RedisClusteringException",
"Message": "Could not find a value for the key S10.233.75.237:11111:66128158",
"Source": "Orleans.Clustering.Redis",
"StackTraceString": " at Orleans.Clustering.Redis.RedisMembershipTable.UpdateIAmAlive(MembershipEntry entry) in /_/src/Redis/Orleans.Clustering.Redis/Storage/RedisMembershipTable.cs:line 156\n at Orleans.Runtime.MembershipService.MembershipTableManager.UpdateIAmAlive() in /_/src/Orleans.Runtime/MembershipService/MembershipTableManager.cs:line 204\n at Orleans.Runtime.MembershipService.MembershipAgent.UpdateIAmAlive() in /_/src/Orleans.Runtime/MembershipService/MembershipAgent.cs:line 71",
"RemoteStackTraceString": null,
"RemoteStackIndex": 0,
"HResult": -2146233088,
"HelpURL": null
},
"EventId": {
"Id": 100659
},
"SourceContext": "Orleans.Runtime.MembershipService.MembershipAgent",
"ElasticApmServiceName": "Okala_MS_Discount_API",
"ElasticApmServiceVersion": "1.0.0",
"ElasticApmServiceNodeName": null,
"ElasticApmGlobalLabels": {},
"MachineName": "ms-discount-57fd88dd86-xxd2c",
"EnvironmentName": "Production",
"ExceptionDetail": {
"HResult": -2146233088,
"Message": "Could not find a value for the key S10.233.75.237:11111:66128158",
"Source": "Orleans.Clustering.Redis",
"TargetSite": "Void MoveNext()",
"Type": "Orleans.Clustering.Redis.RedisClusteringException"
},
"ApplicationName": "Okala.MS.Discount.API"
}
@ArminShoeibi did Redis restart and was no persistence configured? I recommend using Azure Storage for clustering instead
@ArminShoeibi did Redis restart and was no persistence configured? I recommend using Azure Storage for clustering instead
Yes @ReubenBond , only one of our pods crashed twice due to being OOM Killed. We've implemented Redis sentinel with three sentinels and two slaves, and our Redis instance is set up to save data to disk, serving as our primary database, which has proven to be quite stable.
I should note that after enabling Redis grain directory, I encountered the mentioned error.