lloyd-brown
lloyd-brown
This PR fixes an issue we had with consolidation mode where a job’s file mounts would fail to work when performing a rolling update. The problem is that when running...
This PR adds support for users to specify particular exit codes that when encountered should cause the job to be automatically recovered. ```yaml # This YAML will cause the job...
Tested (run the relevant ones): - [ ] Code formatting: install pre-commit (auto-check on commit) or `bash format.sh` - [ ] Any manual or new tests for this PR (please...
The change introduced in https://github.com/skypilot-org/skypilot/pull/8192 to enable multiple jobs per worker by being resource aware does not work for scheduling memory on clouds other than Kubernetes because the launched resources...
[Pools]
Tested (run the relevant ones): - [ ] Code formatting: install pre-commit (auto-check on commit) or `bash format.sh` - [ ] Any manual or new tests for this PR (please...
We previously had an issue where if Kubernetes was loading the cloud section would show 0 clouds enabled before eventually showing the true number of clouds enabled. We add a...
This PR address and issue seen in Nebius where jobs would fail with `RuntimeError: Failed to initialize database due to a timeout when trying to acquire the lock at /home/ubuntu/.sky/locks/.state_db.lock....
This PR addresses a problem where changing the number of workers in a pool leads to us downing workers that are actively running jobs. Here is the working example: we...
## Problem Concurrently launching multiple jobs on pools is currently slow and failure prone. The long time it takes is primarily due to us unnecessarily duplicating a lot of steps...