protoactor-dotnet
protoactor-dotnet copied to clipboard
Proto.Cluster issue when using static port number for Proto.Remote
I found an issue in Proto.Cluster, occurs on the 0.27. If you have a single instance on a single host, you force kill that instance and start a new one quickly. That host is unable to unregister itself from Consul/Etcd. When the new instance starts, it sees 2 cluster members with the same address. Normally, it should be an issue only until the first instance's lease expires, and later it should resolve itself, but it never does. And I end up seeing these logs repeating forever:
Information MemberList did not find any not find any activator for kind '"GameServerManagerActor"'
[PartitionIdentity] No members currently available for kind "GameServerManagerActor"
This is caused by MemberList and SimpleMemberStrategy doing some of the Add/Remove operations by Address, not by ActorSystem's id. And, after the lease of the older instance expires, it leaves the MemberList, together with it removing all members from SimpleMemberStrategy in _memberStrategyByKind for all Kinds that instance supports, which happens to be all the kinds.
The solution could be to modify SimpleMemberStrategy so that it removes by id, not address. I even saw a TODO comment about this. A more stable way could be to simply reconstruct all the IMemberStrategy instances upon a topology change. Also in MemberList _indexByAddress should be reconstructed each time from _metaMembers.
I think this should be fairly easy to solve. The node that is restarted sees the old ghost node with the same address as itself. As the new node knows that the other node must be a ghost, the new node can block the ghost node, and tell every other node about that via gossip.
cc @mhelleborg
@rogeralsing I am not sure if that would be the solution. The problem is the address of the old host and the new one being identical, because we use static port numbers. And SimpleMemberStrategy does RemoveAll by address, so when the old host disappears, all members that support that kind disappear.
I switched our nodes to use auto-bound ports, and I can't reproduce this issue anymore. Once the old host disappears from ClusterProvider, everything resolves itself and the cluster becomes fully healthy.