curator [CURATOR-622] Add Randomness to LeaderLatch Elections

Currently, LeaderLatch uses EPHEMERAL_SEQUENTIAL nodes, with the next leader chosen by the lowest numbered node. In a multi-server environment where each server is a participant in multiple elections, the result is that the leader will always be the server that has been up the longest.(Or first to be restarted during a rolling restart)

Instead of using sequentially numbered nodes, I propose instead that the node number for a new participant be created by adding a random number(From a constrained range) to the current leader number.(Defaults to zero) If a node with that number exists, repeat until an available node is found. After initial node creation, all other aspects of the leader election will remain unchanged.

I have an implementation for this that I am testing locally and will submit a PR once the tests are complete.

Originally reported by FreemanB, imported from: Add Randomness to LeaderLatch Elections

status: Open
priority: Major
resolution: Unresolved
imported: 2025-01-21

Nov 13 '21 18:11 jira-importer

randgalt:

It would help if you explained why this is better. The instances contending for a latch (or any lock) are random already. How is this better?

Nov 13 '21 23:11 jira-importer

JIRAUSER280175:

There is no reason to believe that the instances contending for a latch are random in many operational environments, and the election shouldn't depend on that assumption. The documentation for LeaderLatch says leadership is assigned randomly among threads/processes contending for leadership. However, there is nothing in the LeaderLatch code that actually generates any kind of randomness. In the current implementation, the LeaderLatch is operating more as a leadership queue than an election system, always choosing the next in line based on when a process joined the participant pool instead of randomly. By adding a random factor into the internal election process, the behavior will now match the documentation and it will avoid "clumping" of leadership among the longest-running servers/processes.

In my particular case, we have a group of four servers that are contending for leadership of a larger number of jobs, in my case, delivering data to external clients through various interfaces.(I'll use 20 in this example) Because the jobs are managed separately, each has its own LeaderLatch associated with it. When enabling the jobs, we can assign them to specific servers on startup to balance the load, so each server will initially be leader for 5 tasks, with the other servers ready to take over in the event of a server outage. When applying patches, updates, or other routine maintenance activities, we generally do a rolling restart, allowing the jobs to be automatically re-distributed to the currently running servers, minimizing downtime. However, each time we bring a server down, all of its jobs go to the server that has been up the longest, since it has the lowest sequentially-numbered node for all of the LeaderLatches. After rolling through all of the servers, the first server that was restarted will now be leader of all 100 jobs, with the other servers idle. We're now forced to manually disable and re-enable most of the jobs to force them to redistribute among the servers, a much slower process than if the leadership had been randomly re-assigned.

Nov 14 '21 00:11 jira-importer

randgalt:

> There is no reason to believe that the instances contending for a latch are random in many operational environments

Why not? If n instances contend for a ZNode at around the same time the sequence number they receive is non-deterministic.

> In the current implementation, the LeaderLatch is operating more as a leadership queue than an election system

The current implementation is fair. Fairness in locks requires a queue (fair locks in the JDK do this).

> By adding a random factor into the internal election process...

This will be very hard to do without completely re-writing the LeaderLatch recipe. The current implementation takes advantage of the sequence numbers when instances who don't have the lock watch their predecessor - see line 578, the watch patch is the ZNode path of the previous node.

I'm very much -1 on this change for LeaderLatch. Instead a new recipe should be written to handle your use case.

Nov 14 '21 08:11 jira-importer

cammckenzie:

hey Jordan Zimmerman , I thought that Tim Black 's suggestions were reasonable. I can easily see the example above occurring in a long running production system.

I suggested that he look at modifying the existing LeaderLatch as to me it seemed to fit in there with minimum changes.

On further investigation into the way that a client only watches the previous zNode in the list of sorted children, you're probably right that it's going to be much more complex. My superficial thinking was that ordering of the random nodes would be handled by sorting as it is currently, but with the use of random numbers instead of sequential ones, it means that the 'previous' node in the sorted list of children potentially changes each time a new entry is added to the list. So, using the current algorithm would work (I believe, without putting too much thought into it), but would be less efficient as clients are going to check for leadership when they won't obtain it.

Nov 14 '21 21:11 jira-importer

randgalt:

Cam McKenzie OK - I think the way previous nodes are watched is the biggest issue. Also, we shouldn't change the behavior for existing clients so this randomness would have to be optional. Once all that is done there isn't much left for this recipe IMO.

Nov 14 '21 21:11 jira-importer

cammckenzie:

Jordan Zimmerman , agreed that the existing recipe's functionality should not change. This randomness is definitely an optional extra.

Getting the previous node watching logic is definitely going to be the most complicated piece of code. Each client would need to watch on the children of the latch path and reevaluate its Watch's (cancel old and create new) if its 'previous' node has changed.

Nov 14 '21 22:11 jira-importer