hail icon indicating copy to clipboard operation
hail copied to clipboard

[batch] Consider rewriting scheduler and canceller queries to be more efficient

Open jigold opened this issue 3 months ago • 0 comments

What happened?

This is a follow up issue to a comment in the job groups PR #14282. The queries for the scheduler and canceller both pull the job groups that are schedulable/cancellable and then makes one of two queries depending on if the job is always run. This can be confusing that there's two similar queries and requires multiple calls to the database. There's a question here on whether it is possible to come up with a more efficient query given the indices we have or do we need to create a new index to support this. In case it's hard to find the exact comment in the PR, I pasted an example thread below:

Example comment from Dan: https://github.com/hail-is/hail/pull/14170/files#r1476847018

This is my concrete critqiue: the two SQL queries are nearly the same, but they are long enough to make seeing that challenging.

SELECT jobs.batch_id, jobs.job_id
FROM jobs FORCE INDEX(jobs_batch_id_state_always_run_cancelled)
WHERE batch_id = %s AND job_group_id = %s AND state = 'Ready' AND always_run = 0
LIMIT %s;
SELECT jobs.batch_id, jobs.job_id
FROM jobs FORCE INDEX(jobs_batch_id_state_always_run_cancelled)
WHERE batch_id = %s AND job_group_id = %s AND state = 'Ready' AND always_run = 0 AND cancelled = 1
LIMIT %s;

They only differ in the cancelled condition. As a reader, I'd prefer code that revealed that similarity and used names to suggest what the difference was doing, and maybe a comment, if there is no better way to say it, indicaing what you indicated in your GitHub comment.

if job_group['cancelled']:
    where_job_needs_cancelling = ''  # every job in a cancelled group needs cancelling
else:
    where_job_needs_cancelling = 'AND jobs.cancelled'  # jobs.cancelled means child of a failed job
query_for_jobs_to_be_cancelled = f"""
SELECT jobs.batch_id, jobs.job_id
FROM jobs FORCE INDEX(jobs_batch_id_state_always_run_cancelled)
WHERE batch_id = %s
  AND job_group_id = %s
  AND state = 'Ready'
  AND NOT always_run
  {where_job_needs_cancelling}
LIMIT %s;
"""
async for record in self.db.select_and_fetchall(
    query_for_jobs_to_be_cancelled,
    (job_group['batch_id'], job_group['job_group_id'], remaining.value),
):
    yield record

Other similar threads:

  • https://github.com/hail-is/hail/pull/14282#discussion_r1487118601
  • https://github.com/hail-is/hail/pull/14282#discussion_r1487150061
  • https://github.com/hail-is/hail/pull/14282#discussion_r1487156802

Version

0.2.128

Relevant log output

No response

jigold avatar Mar 14 '24 13:03 jigold