orleans icon indicating copy to clipboard operation
orleans copied to clipboard

Cluster is trying to communicate silo using old IP address

Open MaximTkachenko opened this issue 6 months ago • 0 comments

Environment:

  • net8.0
  • Hosted at Azure k8s
  • Azure table storage clustering provider
  • packages:
    • Microsoft.Orleans.Server 8.2.0
    • Microsoft.Orleans.Hosting.Kubernetes 8.2.0
    • Microsoft.Orleans.Clustering.AzureStorage 8.2.0

Problem: Last night 6 of 12 silos restarted for some reason (probably because of a transient network issue). After that for three pods we have two records in the membership table: one Dead with an old IP address and another one Alive with a new IP address. But during the next 6 hours after the restarts happened (before we made a full deployment), the cluster was trying to communicate with one pod using its old IP address.

Exceptions:

Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Unable to connect to S10.142.35.18:11111:101390272, will retry after 867.2316ms
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 99
   at Orleans.Runtime.Messaging.MessageCenter.<SendMessage>g__SendAsync|30_0(MessageCenter messageCenter, ValueTask`1 connectionTask, Message msg) in /_/src/Orleans.Runtime/Messaging/MessageCenter.cs:line 236
Exception while sending message: Orleans.Runtime.Messaging.ConnectionFailedException: Connection attempt to endpoint S10.142.35.18:11111:101390272 timed out after 00:00:05
   at Orleans.Runtime.Messaging.ConnectionManager.ConnectAsync(SiloAddress address, ConnectionEntry entry) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 219
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106
   at Orleans.Runtime.Messaging.ConnectionManager.GetConnectionAsync(SiloAddress endpoint) in /_/src/Orleans.Core/Networking/ConnectionManager.cs:line 106

Also during these hours there were a lot of Target silo is known to be dead errors.

Is it a known issue?

MaximTkachenko avatar May 12 '25 10:05 MaximTkachenko