hail
hail copied to clipboard
[batch] Add support for open batches
Stacked on #11998
This PR implements open batches. Future PRs will expose the functionality to users in the Query Service and hailtop.batch. There's a design document that specified all of the changes.
To briefly summarize, there are now the concept of batch updates. Each job belongs inside an "update". The BatchClient has two types of builders now: UpdateBatchBuilder and CreateBatchBuilder. I play some tricks with the job ids being allowed to be negative numbers denoting relative to an offset to make things more efficient when updating a batch because you don't have to make multiple API calls to get the current job offset in the batch.
There are only two batch states in the database: running
and complete
. A batch starts out as complete
until an update is committed at which point if the n_jobs > 0, it will change to running
.
The main thing to look at implementation-wise is the new stored procedure commit_batch_update
with a nasty update that will block progress on the batch while the update is in progress.
I added the updates to the UI. We can get rid of it if it's too confusing. There's also a Time Updated
column now in the UI instead of Time Closed
.
![Screen Shot 2022-07-07 at 5 34 07 PM](https://user-images.githubusercontent.com/1693348/177875535-e9b3a99f-bdc9-4a3b-8a53-5d20df05f161.png)
cc: @danking
It's probably not going to pass CI on the first try after the rebase. I'm mainly looking for feedback on the API first.
I think we should find a time to discuss this in person if the following explanation doesn't make sense.
Right now, for small batches, we send one REST request to the server to both create the batch and create the jobs. However, if we want one REST request for an update (ideal for the query service and low latency jobs?), we have to use relative job ids because (1) we don't know the absolute start index of the jobs until we've gotten the start id of the update back from the server and (2) the job dependencies can be a mix of known job ids that have already been previously submitted in a previous creation/update. The negative job IDs are a way to deal with a mix of relative ids within an update and known, submitted job ids.
We can simplify things if we require all updates make two requests to the server to (1) get the start id and establish the update and then (2) submit new jobs with all absolute job IDs. I'd have to make sure this will actually simplify things because I also ran into a bifurcation in how the job IDs are handled in BatchBuilder.create_job()
. We currently populate the spec with a job id before we've made any requests to the server. We need to know how many total jobs there are before we can figure out the job ids because the API for creating a new update requires reserving a block of job IDs which then returns the start id. This complexity is because we allow multiple updates to occur simultaneously to a batch.
This is indeed tricky, but I think there are other options here. For example, we could expand the spec so that a job specifies either a job_id or update_relative_job_id, and same for parents. This way the relative vs absolute job id is baked into the schema of the spec, and not inferred from the sign of the job id. Though similar in concept, I think that would be much less confusing. However,
We can simplify things if we require all updates make two requests to the server to (1) get the start id and establish the update and then (2) submit new jobs with all absolute job IDs.
I'd like to try this first. I feel like if we get a really solid API and it has a couple of superfluous requests in some edge cases, we will be able to come up with good performance shortcuts that don't muddle the normal path. Since the Query Driver currently lives the full life of the batch and is likely to stay that way for a while, it will satisfy these conditions without making any extra requests.
This complexity is because we allow multiple updates to occur simultaneously to a batch.
Right I left this out… I am unsure if we used relative IDs how we would prepare the following update without knowing the IDs of the jobs I'm submitting in this current update, so I think some supplementary request for a reservation of IDs will be necessary anyway.
Another option we might consider is specifying job dependencies with UUIDs instead of numerical IDs.
Ok. I made a new draft of the client, but didn't touch _create_jobs
on the front end at all. Hopefully this is better because job ids are no longer negative. If not, I can try again.
Ok, I think I'm getting a much better understanding of the situation, thank you! I think this is a lot easier for at least me to understand and if I'm not mistaken, it's mostly just moving around of the same code into JobSpec
, right? I think submit
is a lot easier to follow now, though I do sympathize with the pain of the three round-trips for small updates. What do you think about the following proposal?
- Jobs are always submitted to the server with job_ids relative to the update that is being submitted (sorry for the complete reversal, this is just an idea!). This shouldn't have any affect on the current create/create-fast since only 1 update means relative and absolute job ids are the same. This also means that the client doesn't need to know what the
start_job_id
is ahead of submitting a bunch. Submitting the update could return thestart_job_id
such that the client can rectify its localJob
objects with absolute IDs like you do in_get_job_specs_as_json
after the fact. The server will take care of doingabsolute_job_id = update_start_job_id + in_update_job_id
. - parent_ids are represented as positive integers that are tagged with a type, either
in-update
orabsolute
. This is very similar to what you were previously doing with negative numbers, but I think baking it into the schema is going to be less foot-gunny than negative numbers, and the client doesn't have to do any special logic of counting backwards.
Sorry if it's similar to what you were doing before but I think it has taken me a while to fully understand the limitations here. I also think it wouldn't take much changes to this current client implementation to do this.
What do you think of rather than negative numbers or written out types we just do this? 1a
and 1r
. a
is absolute and r
is relative. 1
implies absolute so we don't break old clients. This way we encode the type of the job id without adding a lot more heft to the job spec which would be both more expensive to transmit and store long term.
In a similar vein, what if we just allow the client to either specify parent_id
(which we keep the same and treat as absolute) or rel_parent_id
? Even though we're using JSON and everything is a string anyway, I think I would like to keep anything that can be strictly numeric as numeric and not do additional parsing.
ok. that's fine. I was worried you wanted something like this, which I was going to strongly oppose:
parent_ids = [
{'id': 7, 'type': 'absolute'},
{'id': 8, 'type': 'in_update'}
]
ya I think I wanted something like that in concept but agree with your sentiments about the implementation being wasteful. Having parent_ids
and rel_parent_ids
that are both just arrays of numbers, where the latter gets transformed and concatenated to the former seems like an efficient and nicely backwards-compatible implementation.
Great! I think we are good on the client. I'll modify it later today to use the new scheme. Before we get into the nitty gritty of the actual implementation, can we move on to the UI components and the semantics of using the client in test_batch.py
? There's also a change to how the batch fields time_closed
and time_created
are used. I also added time_updated
. I think the new semantics are:
- time_created -- time the batch was created
- time_updated -- time the last update was committed
- time_completed -- time the last time n_completed == n_jobs regardless of whether there are outstanding updates that haven't been committed
For old batches:
- time_created => same
- time_closed => time_updated
- time_completed => same
I also changed what the batch state means:
There are only two batch states in the database: running and complete. A batch starts out as complete until an update is committed at which point if the n_jobs > 0, it will change to running. There are no longer "open" batches.
It's possible I didn't actually implement exactly what I described above as I was having a hard time figuring out whether time_updated should be equal to time_completed if the batch has no outstanding jobs to run. That's why I'd like to take a step back and make sure we agree on the interface.
I will take a look at test_batch.py
, but I largely agree with the semantics you've written here. The one thing I'm unsure about is time_completed
. Happy to have my mind changed but my initial instinct is that
- It's weird to call a batch complete if it has no jobs. It's elegant, and maybe the right thing, but it feels weird.
- A batch should not have a time_completed if it is still running, i.e. there are pending/running jobs. My instinct says that
time_completed
should be optional, and should only have a value if there are >0 committed jobs in the batch and they are all complete.
As a caveat to my complete statement, maybe we consider batch with 0 jobs as "complete" internally, but just word it differently to the user on the front-end. I think my unease is primarily a user-facing one
A batch should not have a time_completed if it is still running
Pretty sure I implemented this, but good to be in agreement before digging into the code.
time_completed should be optional, and should only have a value if there are >0 committed jobs in the batch and they are all complete
I don't have a strong opinion either way, but I think the batch state is important that it must be complete and it seemed weird to have a "complete" batch but no time completed. So I think I did the thing that would be consistent in the UI between batch state listed and time_completed.
Ok, this feels to me mostly a UI issue, and that we can make any adjustments in how these edge cases are presented to the user as a secondary concern, with the primary concern being a sensible database model which I think we agree on. I'll take a look at the code.
I think I addressed the changes in the client, but I'm sure it's not 100% correct yet. Will deal with any bugs later. Can we move on to the implementation? I'm thinking the best place to start is _create_jobs
and then the SQL implementation of what it means to create and utilize the new updates infrastructure, but happy to do what's easiest for you.
I think I'm seeing more where this approach is coming from, specifically we put batches as they exist today in a special category of having no updates and avoid the new code path in that case. An alternative which pairs with my above suggestion of not adding new staging tables is that all batches have at least 1 update. I feel like if we can force all batches down the new code path we'll be incentivized to make it really low overhead for batches that only submit jobs once, and that will benefit all batches, as well as simplifying the mental model. I may be wrong that we can do this with minimal performance tradeoff, but I'd like to try it first.
I think I'm seeing more where this approach is coming from, specifically we put batches as they exist today in a special category of having no updates and avoid the new code path in that case. An alternative which pairs with my above suggestion of not adding new staging tables is that all batches have at least 1 update. I feel like if we can force all batches down the new code path we'll be incentivized to make it really low overhead for batches that only submit jobs once, and that will benefit all batches, as well as simplifying the mental model. I may be wrong that we can do this with minimal performance tradeoff, but I'd like to try it first.
Can you elaborate more? I'm not sure which code paths you are referring to.
All tables that are written to in _create_jobs
must be tokenized or the job insertion will be serial. This is the reason for the tokens in the staging tables. I had to add another layer of staging which is why the original cancellable resources table has a token, but now the data is collapsed into a single row with token 0 in the new stored procedure.
I think your idea of adding new columns to the existing tables is better and is now something we can do. When I wrote this PR, MySQL v8.0 wasn't an option.
Also, I'm not opposed to doing a migration of the existing batches to having rows in the updates table. It will just take some work to write a chunking script because we don't want to do the migration all in one transaction. Luckily this code doesn't have to be as well tested as other migrations.
I'm adding an additional migration PR that will put in the backwards compatibility for the updates table etc.
I think I'm seeing more where this approach is coming from, specifically we put batches as they exist today in a special category of having no updates and avoid the new code path in that case. An alternative which pairs with my above suggestion of not adding new staging tables is that all batches have at least 1 update. I feel like if we can force all batches down the new code path we'll be incentivized to make it really low overhead for batches that only submit jobs once, and that will benefit all batches, as well as simplifying the mental model. I may be wrong that we can do this with minimal performance tradeoff, but I'd like to try it first.
Can you elaborate more? I'm not sure which code paths you are referring to.
Mainly that the commit procedure branches on whether the start id is 1 and that we sometimes grab the update id from the batch token and sometimes from the update table. Not very different code paths but slightly different, which could lead to some confusion.
Sorry for not getting this done quicker. There's two new soon to be PRs in the stack that you can see as commits here:
I'll make PRs for them on Monday once you give me the green light that no other major database changes are needed.
Also, having the migration be online and in multiple stages was making my head hurt a lot. Maybe on Monday we should sit down and white board this and make sure there's no edge cases.
Felt quick to me! Ya we can talk this over on Monday but from a quick glance seems like a reasonable approach.
Mainly that the commit procedure branches on whether the start id is 1
I was trying to avoid an expensive query to update all of the n_pending_parents when we know it's the first update being committed. I think I'd prefer to keep this branching there.
I was having trouble figuring out how to handle the token and the attributes in hailtop.batch_client.aioclient.Batch. When we create an update from a Batch that already existed perhaps in a different process, we don't have the attributes and token. I made a contract where commit_update
always returns the token and attributes regardless of whether the BatchBuilder already has that infromation. However, we could also get that information available lazily and cache the result. In addition, the n_jobs
returned to the client are the number of jobs that are committed and not the same as the n_jobs
in the batches table.
Things to do before merging:
- Get rid of the batch updates additions to the UI2.
- Double check the GCP LogsExplorer to make sure there are no silent error messages especially with regards to cancellation.
- Have @danking look over the SQL stored procedure for
commit_batch_update
to make sure that query is going to perform as good as what is possible given the complexity of the check. - Run a test batch with the old client (I just checked out the current version of main). You need to make sure both create and create-fast are accounted for and succeed. I've been using the following script to make sure we're using the slow path in addition to the fast path with a regular small test job:
from hailtop.batch import ServiceBackend, Batch
import secrets
backend = ServiceBackend(billing_project='hail')
b = Batch(backend=backend)
# 8 * 256 * 1024 = 2 MiB > 1 MiB max bunch size
for i in range(8):
j1 = b.new_job()
long_str = secrets.token_urlsafe(256 * 1024)
j1.command(f'echo "{long_str}" > /dev/null')
batch = b.run()
Closing in favor of #12199