yugabyte-db icon indicating copy to clipboard operation
yugabyte-db copied to clipboard

[YSQL] Resolve Multiple Leader Scenarios

Open xenown opened this issue 2 years ago • 0 comments

Jira Link: DB-2763

Description

This is the 4th change in a series of diffs for porting the Postgres extension pg_cron to yugabyte. The following summary assumes you understand how the original pg cron extension works. See https://github.com/yugabyte/yugabyte-db/issues/11907 for brief description on how job assignment and execution will operate on Yugabyte.

Up until now, the leadership status of cron workers is based entirely on whether the tserver associated with the cron worker is the tablet leader for the table cron.job's only tablet. It is possible that the tablet leader will change so cron worker leaderships can also change.

This presents two problems that are somewhat intertwined:

Problem 1: How does the new leader worker account for the time change when assigning job runs? Every leader worker has their own perception of time so clock skew will always exist when a leader switch occurs. The new leader can detect this change by checking the assignment time of the latest job run. Assignment time does not currently exist in cron.job_run_details so we need to add a column for that. With the assignment time known, we need to define the job scheduling behaviour after a leader switch which accounts for time changes.

A well documented job scheduling behaviour we can use for time changes already exists in pg_cron. Examples for 2 of its most notable rules are:

  1. Assume all jobs up to minute T have been assigned. Time switches to start of minute T-3. Wait 4 minutes until the start of minute T+1 to start assigning jobs that are scheduled for minute T+1.
  2. Assume all jobs up to minute T have been assigned. Time switches to start of minute T+3. Immediately assign jobs for jobs scheduled for minutes T+1, T+2 and T+3.

The new leader would need to assume that all jobs runs that need to be assigned in the minute of the latest job run, have been assigned for the above rules to be usable. Thus, it follows that a leader worker must assign all the jobs runs for a given minute in 1 single transaction.

Problem 2: How do we handle multiple leader workers? A cron worker polls periodically for its leadership status. Regardless of the polling frequency, it is possible for there to be more than one leader worker in the cluster for some time (e.g. two workers both read their shared memory at possibly different times and find that they are a leader). This can result in jobs being assigned to run more than once per minute for some scheduled time if we are not careful.

Job run assignment occurs when a leader worker inserts into the cron.job_run_details table. Hence, there are 2 cases to consider for this problem:

  1. Concurrent inserts: multiple leader workers assign job runs at the same time.
  2. Non-concurrent inserts: a leader worker assigns job runs after another leader worker has already assigned job runs.

Any easy way to account for case 1 is by having the leader workers acquire a row-level lock on a dummy row in some table. Trailing leader workers would fail to acquire the lock and immediately transition to being a worker. Some time later, the worker will refresh their leadership status. If all cron helper clients (see #11907) are publishing the correct leadership status to all cron workers, the cluster should converge to a single cron leader worker state. In the worse case, another leader switch occurs (the solution to problem 1 still applies) or there is no leader worker in the cluster (the pg_cron extension cannot operate if the job table tablet isn't available).

For case 2, we have the following scenario which produces the problem:

  1. Leader worker A is at the start of minute M
  2. Worker B is the end of minute M and becomes a leader worker.
  3. There exists a job J that is schedules to run every minute. A assigns a job run for J for minute M. The job run completes immediately.
  4. 1 second later, B gets to the start of minute M+1 and assigns a job run for J for minute M+1. Only 1 real second has elapsed between the two job runs assigned by A and B.

If we follow the job scheduling rules established in the solution to problem 1, the cases that cause problem 2 can only occur when the following conditions are met:

  • the new leader is ahead in time compared to the old leader. Otherwise, the new leader will wait at least a minute before it will attempt to assign something.
  • the new leader is going to assign a job run at its next minute for a job that was already assigned to run within the last minute by the previous leader.

One way to fail these conditions is to always have workers which transition to being a leader worker wait 1 minute before assigning any job runs. Job runs will still be queued in the 1 minute of waiting. This ensures that there will always be at least 1 minute between any job runs of the same job. However, we are assuming that the cluster will converge to a single cron leader worker state within the minute. If we cannot make this guarantee, then another option is to rely on the primary key property of the runid column in cron.job_run_details. Leader workers would have to:

  1. keep some runid counter X in memory
  2. Provide runid X when inserting rows into cron.job_run_details
  3. Increment X after an insert.
  4. Initially set X to be one after the greatest runid in cron.job_run_details any time a worker transitions to being a leader worker.

Since each leader worker only reads from the table once for the latest runid, only one leader worker will be able to insert while the rest will find a primary key conflict. Those that get the conflict would transition back to being a worker.

xenown avatar Apr 12 '22 19:04 xenown