skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

[Pools] Improve Concurrent Job Launch

Open lloyd-brown opened this issue 2 months ago • 16 comments

Problem

Concurrently launching multiple jobs on pools is currently slow and failure prone. The long time it takes is primarily due to us unnecessarily duplicating a lot of steps in the job provisioning process (submitting controller tasks, rsyncing files, invoking the jobs scheduler).

Approach

This PR improve the submission of multiple jobs by sharing nearly all of the job submission steps among each job replica. We now

  • In bulk pre-create our job IDs with a single database operation on the jobs controller
  • Create a single controller task that launches all job replicas which only requires assembling the state of the job (yaml files, dag) once
  • In that controller task invoke the scheduler with a single call to the python interpreter

Sharing the job dag took a bit of extra care because we have an environment variable $SKYPILOT_JOB_RANK that lets you use the rank to parallelize work and that variable is currently set by appending an env var to the task object. Since this value needs to be different for each job we can't append it and have it be different for each replica. To fix this we create a dictionary in our controller task that maps from the replica ID to the rank, store it on the job controller in a file, and then load it when we create the JobController instance for a job.

I have also added support for using gRPC to perform our task creation with add_job by adding a new num_jobs field to indicate the number of jobs we want to create and adding new job_ids and log_dirs return arguments so that we can get the job ids back in bulk.

For both codegen and gRPC I've added code to make sure that we are compatible with a legacy jobs controller by repeatedly calling add_jobs until we get the number of jobs we need.

I've also modified consolidation mode to support concurrent launch (previously it would create the tasks but fail to schedule them).

Testing

  • I added a new smoke test that ensures that $SKYPILOT_JOB_RANK is properly set
  • Added a new test to ensure that the launch time with --num-jobs is shortened

Remaining Work

  • I need to do backwards compatibility testing with my smoke tests for both codegen and gRPC for an old jobs controller
  • Also this seems to have uncovered another issue, if you submit a large number of jobs it takes a while for the jobs controller to move them over to pending. Unfortunately we can't cancel the jobs until they are pending so you effectively have to wait until they've all moved over, cancellations in the interim only cancel jobs that are currently pending.

Performance

Launching 100 concurrent jobs used to take us 4.5 minutes and now takes 30 seconds!

https://github.com/user-attachments/assets/807f6cce-2d4c-4b7e-b994-cf041bd7a3bc

Update:

We have added a new backend call that will instruct the jobs controller to set the job info and set the job to pending, the implications are:

  • Autostop is now supported since we aren’t calling add_job and leaving jobs in INIT
  • The jobs scheduler no longer has the set job info and pending calls I added
  • The cancellation issue I raised earlier is no longer a problem because jobs start in pending
  • We revert the changes to add_job

This also addresses https://github.com/skypilot-org/skypilot/issues/6932

Tested (run the relevant ones):

  • [ ] Code formatting: install pre-commit (auto-check on commit) or bash format.sh
  • [ ] Any manual or new tests for this PR (please specify below)
  • [ ] All smoke tests: /smoke-test (CI) or pytest tests/test_smoke.py (local)
  • [ ] Relevant individual tests: /smoke-test -k test_name (CI) or pytest tests/test_smoke.py::test_name (local)
  • [ ] Backward compatibility: /quicktest-core (CI) or pytest tests/smoke_tests/test_backward_compat.py (local)

lloyd-brown avatar Nov 07 '25 05:11 lloyd-brown

/smoke-test

lloyd-brown avatar Nov 10 '25 07:11 lloyd-brown

/smoke-test --managed-jobs

lloyd-brown avatar Nov 10 '25 21:11 lloyd-brown

/quicktest-core /quicktest-core --base-branch 0.10.3 /quicktest-core --base-branch 0.9.3 /quicktest-core --base-branch 0.8.1

cg505 avatar Nov 11 '25 22:11 cg505

/smoke-test --managed-jobs --kubernetes /smoke-test -k test_pools --kubernetes

lloyd-brown avatar Nov 15 '25 00:11 lloyd-brown

/smoke-test --managed-jobs --kubernetes

lloyd-brown avatar Nov 17 '25 17:11 lloyd-brown

/smoke-test --managed-jobs --kubernetes /smoke-test -k test_pools --kubernetes

lloyd-brown avatar Nov 17 '25 20:11 lloyd-brown

/smoke-test --managed-jobs --kubernetes /smoke-test -k test_pools --kubernetes

lloyd-brown avatar Nov 17 '25 20:11 lloyd-brown

/smoke-test --managed-jobs --kubernetes /smoke-test -k test_pools --kubernetes

lloyd-brown avatar Nov 18 '25 20:11 lloyd-brown

/smoke-test -k test_managed_jobs_storage --kubernetes

lloyd-brown avatar Nov 19 '25 19:11 lloyd-brown

/smoke-test

lloyd-brown avatar Nov 19 '25 20:11 lloyd-brown

/smoke-test -k test_managed_jobs_basic --aws

lloyd-brown avatar Nov 19 '25 21:11 lloyd-brown

/smoke-test --managed-jobs --kubernetes

lloyd-brown avatar Nov 19 '25 22:11 lloyd-brown

/smoke-test -k test_pools --kubernetes

lloyd-brown avatar Nov 19 '25 22:11 lloyd-brown

/smoke-test --managed-jobs --kubernetes /smoke-test --managed-jobs --kubernetes --jobs-consolidation /smoke-test -k test_pools --kubernetes /smoke-test -k test_pools --kubernetes --jobs-consolidation

lloyd-brown avatar Dec 01 '25 22:12 lloyd-brown

/smoke-test

lloyd-brown avatar Dec 11 '25 02:12 lloyd-brown

/smoke-test

lloyd-brown avatar Dec 11 '25 02:12 lloyd-brown