akka.net
akka.net copied to clipboard
Akka.Cluster.Tools.Singleton: singleton moves as soon as node with higher `AppVersion` joins cluster?
Version Information Version of Akka.NET? v1.5.0 Which Akka.NET Modules? Akka.Cluster.Tools
Describe the bug
Chasing down and issue for a production support customer - they have a custom pbm command for being able to track the location of cluster singletons. They confirmed the singleton was on a specific node and decided to replace that one last during a version upgrade. What they observed was: the singleton moved onto the newest node with the highest AppVersion even before that oldest node was downed!
Expected behavior
As I wrote back to the customer originally, the singleton should only move onto a new node AFTER the node it's currently on begins to leave the cluster. This leads me to believe that the following code might have a bug in how we compute the sort order for who the most suitable location is for a singleton:
https://github.com/akkadotnet/akka.net/blob/3f0be58a661150c3d14572cd4615b526ba5e037a/src/contrib/cluster/Akka.Cluster.Tools/Singleton/OldestChangedBuffer.cs#L98-L112
In fact, I'm almost certain that this is the case.
Marking this bug as critical - one of the major side effects from this issue is that we can create split brains with all cluster singletons during deployments when the AppVersion is getting bumped. That can result in problems such as #6973
So this bug likely affected less people than I initially thought as
https://github.com/akkadotnet/akka.net/blob/d1ed226e8b140215427bbd8ffd58130662d7ff28/src/contrib/cluster/Akka.Cluster.Tools/Singleton/reference.conf#L49
Has been set to false this whole time and that's also the default value from the HOCON extractors when this configuration isn't available. That's good news, but it still needed to be fixed.
Looks like the original issue reported by the end user wasn't even caused by the AppVersion, but this feature is definitely a footgun and probably needs to be removed.