[Pools] Improve Concurrent Job Launch
Problem
Concurrently launching multiple jobs on pools is currently slow and failure prone. The long time it takes is primarily due to us unnecessarily duplicating a lot of steps in the job provisioning process (submitting controller tasks, rsyncing files, invoking the jobs scheduler).
Approach
This PR improve the submission of multiple jobs by sharing nearly all of the job submission steps among each job replica. We now
- In bulk pre-create our job IDs with a single database operation on the jobs controller
- Create a single controller task that launches all job replicas which only requires assembling the state of the job (yaml files, dag) once
- In that controller task invoke the scheduler with a single call to the python interpreter
Sharing the job dag took a bit of extra care because we have an environment variable $SKYPILOT_JOB_RANK that lets you use the rank to parallelize work and that variable is currently set by appending an env var to the task object. Since this value needs to be different for each job we can't append it and have it be different for each replica. To fix this we create a dictionary in our controller task that maps from the replica ID to the rank, store it on the job controller in a file, and then load it when we create the JobController instance for a job.
I have also added support for using gRPC to perform our task creation with add_job by adding a new num_jobs field to indicate the number of jobs we want to create and adding new job_ids and log_dirs return arguments so that we can get the job ids back in bulk.
For both codegen and gRPC I've added code to make sure that we are compatible with a legacy jobs controller by repeatedly calling add_jobs until we get the number of jobs we need.
I've also modified consolidation mode to support concurrent launch (previously it would create the tasks but fail to schedule them).
Testing
- I added a new smoke test that ensures that
$SKYPILOT_JOB_RANKis properly set - Added a new test to ensure that the launch time with
--num-jobsis shortened
Remaining Work
- I need to do backwards compatibility testing with my smoke tests for both codegen and gRPC for an old jobs controller
- Also this seems to have uncovered another issue, if you submit a large number of jobs it takes a while for the jobs controller to move them over to pending. Unfortunately we can't cancel the jobs until they are pending so you effectively have to wait until they've all moved over, cancellations in the interim only cancel jobs that are currently pending.
Performance
Launching 100 concurrent jobs used to take us 4.5 minutes and now takes 30 seconds!
https://github.com/user-attachments/assets/807f6cce-2d4c-4b7e-b994-cf041bd7a3bc
Update:
We have added a new backend call that will instruct the jobs controller to set the job info and set the job to pending, the implications are:
- Autostop is now supported since we aren’t calling add_job and leaving jobs in INIT
- The jobs scheduler no longer has the set job info and pending calls I added
- The cancellation issue I raised earlier is no longer a problem because jobs start in pending
- We revert the changes to
add_job
This also addresses https://github.com/skypilot-org/skypilot/issues/6932
Tested (run the relevant ones):
- [ ] Code formatting: install pre-commit (auto-check on commit) or
bash format.sh - [ ] Any manual or new tests for this PR (please specify below)
- [ ] All smoke tests:
/smoke-test(CI) orpytest tests/test_smoke.py(local) - [ ] Relevant individual tests:
/smoke-test -k test_name(CI) orpytest tests/test_smoke.py::test_name(local) - [ ] Backward compatibility:
/quicktest-core(CI) orpytest tests/smoke_tests/test_backward_compat.py(local)
/smoke-test
/smoke-test --managed-jobs
/quicktest-core /quicktest-core --base-branch 0.10.3 /quicktest-core --base-branch 0.9.3 /quicktest-core --base-branch 0.8.1
/smoke-test --managed-jobs --kubernetes /smoke-test -k test_pools --kubernetes
/smoke-test --managed-jobs --kubernetes
/smoke-test --managed-jobs --kubernetes /smoke-test -k test_pools --kubernetes
/smoke-test --managed-jobs --kubernetes /smoke-test -k test_pools --kubernetes
/smoke-test --managed-jobs --kubernetes /smoke-test -k test_pools --kubernetes
/smoke-test -k test_managed_jobs_storage --kubernetes
/smoke-test
/smoke-test -k test_managed_jobs_basic --aws
/smoke-test --managed-jobs --kubernetes
/smoke-test -k test_pools --kubernetes
/smoke-test --managed-jobs --kubernetes /smoke-test --managed-jobs --kubernetes --jobs-consolidation /smoke-test -k test_pools --kubernetes /smoke-test -k test_pools --kubernetes --jobs-consolidation
/smoke-test
/smoke-test