hail
hail copied to clipboard
[batch] Orphaned attempts loop performs full scan of instances table
What happened?
Batch does not guarantee that there is always at most 1 running attempt for a job at any given time. While rare, this double scheduling can sometimes happen so there is a background task that checks the database for "orphaned" attempts -- attempts that are running but are not noted as the current attempt for the relevant job -- and stops them to reduce wasted spend. This query that polls the database for attempts to remove does a needless scan of the instances table. I'll describe below the process by which I discovered the inefficiency:
- GCP Cloud SQL has a nice feature Query Insights, in which reports latencies and rows scanned by popular queries. For singular queries it can show a graph of the query and indicate bottlenecks. The below query is currently scanning over a million rows of the instances table is taking on average 642.37 ms:
https://github.com/hail-is/hail/blob/091e6612752010880a130cf4010897e87ea2a864/batch/batch/driver/canceller.py#L373-L382
as shown here from Query Insights:
- The thick edge on the instances scan indicates that the where condition for instances is not using an index. We can verify this by explaining the query against the DB:
> kssh admin-pod
> mysql
mysql> use batch;
mysql> EXPLAIN SELECT attempts.*
-> FROM attempts
-> INNER JOIN jobs ON attempts.batch_id = jobs.batch_id AND attempts.job_id = jobs.job_id
-> LEFT JOIN instances ON attempts.instance_name = instances.name
-> WHERE attempts.start_time IS NOT NULL
-> AND attempts.end_time IS NULL
-> AND ((jobs.state != 'Running' AND jobs.state != 'Creating') OR jobs.attempt_id != attempts.attempt_id)
-> AND instances.`state` = 'active'
-> ORDER BY attempts.start_time ASC
-> LIMIT 300\G;
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: instances
partitions: NULL
type: ALL
possible_keys: PRIMARY
key: NULL
key_len: NULL
ref: NULL
rows: 1150201
filtered: 10.00
Extra: Using where; Using temporary; Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: attempts
partitions: NULL
type: ref
possible_keys: PRIMARY,attempts_instance_name
key: attempts_instance_name
key_len: 303
ref: batch.instances.name
rows: 91
filtered: 9.00
Extra: Using where
*************************** 3. row ***************************
id: 1
select_type: SIMPLE
table: jobs
partitions: NULL
type: eq_ref
possible_keys: PRIMARY,jobs_batch_id_state_always_run_cancelled,jobs_batch_id_state_always_run_inst_coll_cancelled,jobs_batch_id_update_id,jobs_batch_id_always_run_n_regions_regions_bits_rep_job_id,jobs_batch_id_ic_state_ar_n_regions_bits_rep_job_id,jobs_batch_id_job_group_id,jobs_batch_id_ic_state_ar_n_regions_bits_rep_job_group_id
key: PRIMARY
key_len: 12
ref: batch.attempts.batch_id,batch.attempts.job_id
rows: 1
filtered: 98.10
Extra: Using where
3 rows in set, 1 warning (0.00 sec)
This is not great:
rows: 1150201
filtered: 10.00
Extra: Using where; Using temporary; Using filesort
what we want to see is a low number of rows, a high percent filtered, and something like Extra: Using where; Using index;
- We can then verify that this finding aligns with our current understanding of the database schema where there is no index on
instances.state
: https://github.com/hail-is/hail/blob/1f3a0503926b65f479dce6d5eb105236632f8d07/batch/sql/estimated-current.sql#L113-L138
Remaining questions
- Why do we need to join against the instances table to find orphaned attempts? Can we not?
- If we do need to join against the instances table, should we create an index on
instances.state
? How doesinstances.removed
relate to this use case since it seems relevant and already has an index?
Deliverable
This query should perform indexed lookups of involved tables. The PR should compare the present and proposed EXPLAINs and provide some manual timing comparisons.
Version
0.2.130
Relevant log output
No response