hail [batch] Orphaned attempts loop performs full scan of instances table

[batch] Orphaned attempts loop performs full scan of instances table

Open daniel-goldstein opened this issue 2 months ago • 0 comments

What happened?

Batch does not guarantee that there is always at most 1 running attempt for a job at any given time. While rare, this double scheduling can sometimes happen so there is a background task that checks the database for "orphaned" attempts -- attempts that are running but are not noted as the current attempt for the relevant job -- and stops them to reduce wasted spend. This query that polls the database for attempts to remove does a needless scan of the instances table. I'll describe below the process by which I discovered the inefficiency:

GCP Cloud SQL has a nice feature Query Insights, in which reports latencies and rows scanned by popular queries. For singular queries it can show a graph of the query and indicate bottlenecks. The below query is currently scanning over a million rows of the instances table is taking on average 642.37 ms:

https://github.com/hail-is/hail/blob/091e6612752010880a130cf4010897e87ea2a864/batch/batch/driver/canceller.py#L373-L382

as shown here from Query Insights: Screenshot 2024-04-11 at 10 31 17 AM

The thick edge on the instances scan indicates that the where condition for instances is not using an index. We can verify this by explaining the query against the DB:

> kssh admin-pod
> mysql
mysql> use batch;
mysql> EXPLAIN SELECT attempts.*
    -> FROM attempts
    -> INNER JOIN jobs ON attempts.batch_id = jobs.batch_id AND attempts.job_id = jobs.job_id
    -> LEFT JOIN instances ON attempts.instance_name = instances.name
    -> WHERE attempts.start_time IS NOT NULL
    ->   AND attempts.end_time IS NULL
    ->   AND ((jobs.state != 'Running' AND jobs.state != 'Creating') OR jobs.attempt_id != attempts.attempt_id)
    ->   AND instances.`state` = 'active'
    -> ORDER BY attempts.start_time ASC
    -> LIMIT 300\G;

*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: instances
   partitions: NULL
         type: ALL
possible_keys: PRIMARY
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1150201
     filtered: 10.00
        Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: attempts
   partitions: NULL
         type: ref
possible_keys: PRIMARY,attempts_instance_name
          key: attempts_instance_name
      key_len: 303
          ref: batch.instances.name
         rows: 91
     filtered: 9.00
        Extra: Using where
*************************** 3. row ***************************
           id: 1
  select_type: SIMPLE
        table: jobs
   partitions: NULL
         type: eq_ref
possible_keys: PRIMARY,jobs_batch_id_state_always_run_cancelled,jobs_batch_id_state_always_run_inst_coll_cancelled,jobs_batch_id_update_id,jobs_batch_id_always_run_n_regions_regions_bits_rep_job_id,jobs_batch_id_ic_state_ar_n_regions_bits_rep_job_id,jobs_batch_id_job_group_id,jobs_batch_id_ic_state_ar_n_regions_bits_rep_job_group_id
          key: PRIMARY
      key_len: 12
          ref: batch.attempts.batch_id,batch.attempts.job_id
         rows: 1
     filtered: 98.10
        Extra: Using where
3 rows in set, 1 warning (0.00 sec)

This is not great:

         rows: 1150201
     filtered: 10.00
        Extra: Using where; Using temporary; Using filesort

what we want to see is a low number of rows, a high percent filtered, and something like Extra: Using where; Using index;

We can then verify that this finding aligns with our current understanding of the database schema where there is no index on instances.state: https://github.com/hail-is/hail/blob/1f3a0503926b65f479dce6d5eb105236632f8d07/batch/sql/estimated-current.sql#L113-L138

Remaining questions

Why do we need to join against the instances table to find orphaned attempts? Can we not?
If we do need to join against the instances table, should we create an index on instances.state? How does instances.removed relate to this use case since it seems relevant and already has an index?

Deliverable

This query should perform indexed lookups of involved tables. The PR should compare the present and proposed EXPLAINs and provide some manual timing comparisons.

Version

0.2.130

Relevant log output

No response

Apr 11 '24 15:04 daniel-goldstein

hail hail copied to clipboard

[batch] Orphaned attempts loop performs full scan of instances table

What happened?

Remaining questions

Deliverable

Version

Relevant log output

hail
hail copied to clipboard