hail icon indicating copy to clipboard operation
hail copied to clipboard

[batch] Remove blob storage read inside Worker job creation endpoint

Open daniel-goldstein opened this issue 2 months ago • 0 comments

What happened?

Below is a high level overview of how the batch driver communicates scheduled jobs to worker nodes.

Scheduling loop on the driver:

  1. Select N ready jobs from the database to schedule on available workers
  2. Compute placement of a subset of the jobs in available slots in the worker pool
  3. Concurrently call /api/v1alpha/batches/jobs/create on available workers for each placed job. If/when the request completes successfully, the job is marked as scheduled.
  4. Once all requests complete, goto 1

On the worker, what happens inside /api/v1alpha/batches/jobs/create:

  1. Read metadata describing the job to schedule from the request body
  2. Using that information, load the full job spec from blob storage
  3. Spawn a task to run the job asynchronously
  4. Respond to the driver with a 200

The key point relevant to this issue is that the driver currently must wait for all the requests to workers in an iteration to complete before it starts the next iteration of the scheduler. This leaves the scheduler vulnerable to problematic workers or workers that happen to be preempted during the scheduling process. So, the driver sets a 2 second timeout on the call to /api/v1alpha/batches/jobs/create. Additionally, this general design means that in the event of a request timeout or transient error, Batch cannot guarantee that there is always at most one concurrent running attempt for a given job. This ends up being a fine (and intentional) concession in practice because the idempotent design of preemptible jobs tends to cover this scenario, but it is regardless wasted compute and cost to users.

Nevertheless, we strive to minimize cases where we might halt the scheduling loop or double-schedule work, and one way to do that in the current design is to minimize the variance in latency of /api/v1alpha/batches/jobs/create. The largest source of this latency is the request to blob storage. While GCS and ABS are relatively fast and highly available, Batch in Azure Terra requires first obtaining SAS tokens from the Terra control plane, which can introduce much higher and more variable latency. There have also been occurrences in the past of corrupted or deleted specs, which introduce unexpected failure modes that should error the job but instead disrupt the scheduling loop.

Many of these problems would be mitigated by moving the read from object storage outside of the /api/v1alpha/batches/jobs/create endpoint. The endpoint should push this read into the asynchronous task that ultimately runs the job and therefore return its acknowledgement to the driver faster. If the worker encounters errors later on while reading the spec, those should result in erroring the job instead of raising a 500 in the scheduling request.

Version

0.2.129

Relevant log output

No response

daniel-goldstein avatar Apr 10 '24 16:04 daniel-goldstein