akka.net icon indicating copy to clipboard operation
akka.net copied to clipboard

Akka.Cluster.Tools.Singleton: singleton moves as soon as node with higher `AppVersion` joins cluster?

Open Aaronontheweb opened this issue 1 year ago • 3 comments

Version Information Version of Akka.NET? v1.5.0 Which Akka.NET Modules? Akka.Cluster.Tools

Describe the bug

Chasing down and issue for a production support customer - they have a custom pbm command for being able to track the location of cluster singletons. They confirmed the singleton was on a specific node and decided to replace that one last during a version upgrade. What they observed was: the singleton moved onto the newest node with the highest AppVersion even before that oldest node was downed!

Expected behavior

As I wrote back to the customer originally, the singleton should only move onto a new node AFTER the node it's currently on begins to leave the cluster. This leads me to believe that the following code might have a bug in how we compute the sort order for who the most suitable location is for a singleton:

https://github.com/akkadotnet/akka.net/blob/3f0be58a661150c3d14572cd4615b526ba5e037a/src/contrib/cluster/Akka.Cluster.Tools/Singleton/OldestChangedBuffer.cs#L98-L112

In fact, I'm almost certain that this is the case.

Aaronontheweb avatar May 22 '24 21:05 Aaronontheweb

Marking this bug as critical - one of the major side effects from this issue is that we can create split brains with all cluster singletons during deployments when the AppVersion is getting bumped. That can result in problems such as #6973

Aaronontheweb avatar May 23 '24 17:05 Aaronontheweb

So this bug likely affected less people than I initially thought as

https://github.com/akkadotnet/akka.net/blob/d1ed226e8b140215427bbd8ffd58130662d7ff28/src/contrib/cluster/Akka.Cluster.Tools/Singleton/reference.conf#L49

Has been set to false this whole time and that's also the default value from the HOCON extractors when this configuration isn't available. That's good news, but it still needed to be fixed.

Aaronontheweb avatar May 23 '24 20:05 Aaronontheweb

Looks like the original issue reported by the end user wasn't even caused by the AppVersion, but this feature is definitely a footgun and probably needs to be removed.

Aaronontheweb avatar May 24 '24 18:05 Aaronontheweb