hail icon indicating copy to clipboard operation
hail copied to clipboard

[batch] Add Job Groups to Batch

Open jigold opened this issue 4 months ago • 9 comments

This PR adds the job groups functionality as described in this RFC to the Batch backend and hailtop.batch_client. This includes supporting nested job groups up to a maximum depth of 5. Note, that none of these changes are user-facing yet (hence no change log here).

The PRs that came before this one:

  • #13475
  • #13487
  • #13810 (note that this database migration required a shutdown)

Subsequent PRs will need to implement the following:

  • Querying job groups with the flexible query language (v2)
  • Implementing job groups in the Scala Client for QoB
  • Using job groups in QoB with cancel_after_n_failures=1 for all new stages of worker jobs
  • UI functionality to page and sort through job groups
  • A new hailtop.batch interface for users to define and work with Job Groups

A couple of nuances in the implementation came up that I also tried to articulate in the RFC:

  1. A root job group with ID = 0 does not belong to an update ("update_id" IS NULL). This means that any checks that look for "committed" job groups need to do (batch_updates.committed OR job_groups.job_group_id = %s) where "%s" is the ROOT_JOB_GROUP_ID.
  2. When job groups are cancelled, only the specific job group that was cancelled is inserted into job_groups_cancelled. This table does NOT contain all transitive job groups that were also cancelled indirectly. The reason for this is we cannot guarantee that a user wouldn't have millions of job groups and we can't insert millions of records inside a single SQL stored procedure. Now, any query on the driver / front_end must look up the tree and see if any parent has been cancelled. This code looks similar to the code below [1].
  3. There used to be DELETE FROM statements in commit_batch_update and commit_batch that cleaned up old records that were no longer used in job_group_inst_coll_cancellable_resources and job_groups_inst_coll_staging. This cleanup now occurs in a periodic loop on the driver.
  4. The job_group_inst_coll_cancellable_resources and job_groups_inst_coll_staging tables have values which represent the sum of all child job groups. For example, if a job group has 1 job and it's child job group has 2 jobs, then the staging table would have n_jobs = 3 for the parent job group and n_jobs = 2 for the child job group. Likewise, all of the billing triggers and MJC have to use the job_group_self_and_ancestors table to modify the job group the job belongs to as well its parent job groups.

[1] Code to check whether a job group has been cancelled.

SELECT job_groups.*,
  cancelled_t.cancelled IS NOT NULL AS cancelled
FROM job_groups
LEFT JOIN LATERAL (
  SELECT 1 AS cancelled
  FROM job_group_self_and_ancestors
  INNER JOIN job_groups_cancelled
    ON job_group_self_and_ancestors.batch_id = job_groups_cancelled.id AND
      job_group_self_and_ancestors.ancestor_id = job_groups_cancelled.job_group_id
  WHERE job_groups.batch_id = job_group_self_and_ancestors.batch_id AND
    job_groups.job_group_id = job_group_self_and_ancestors.job_group_id
) AS cancelled_t ON TRUE
WHERE ...

jigold avatar Feb 11 '24 22:02 jigold