lloyd-brown

Results 9 issues of lloyd-brown

This PR fixes an issue we had with consolidation mode where a job’s file mounts would fail to work when performing a rolling update. The problem is that when running...

This PR adds support for users to specify particular exit codes that when encountered should cause the job to be automatically recovered. ```yaml # This YAML will cause the job...

Tested (run the relevant ones): - [ ] Code formatting: install pre-commit (auto-check on commit) or `bash format.sh` - [ ] Any manual or new tests for this PR (please...

The change introduced in https://github.com/skypilot-org/skypilot/pull/8192 to enable multiple jobs per worker by being resource aware does not work for scheduling memory on clouds other than Kubernetes because the launched resources...

Tested (run the relevant ones): - [ ] Code formatting: install pre-commit (auto-check on commit) or `bash format.sh` - [ ] Any manual or new tests for this PR (please...

We previously had an issue where if Kubernetes was loading the cloud section would show 0 clouds enabled before eventually showing the true number of clouds enabled. We add a...

This PR address and issue seen in Nebius where jobs would fail with `RuntimeError: Failed to initialize database due to a timeout when trying to acquire the lock at /home/ubuntu/.sky/locks/.state_db.lock....

This PR addresses a problem where changing the number of workers in a pool leads to us downing workers that are actively running jobs. Here is the working example: we...

## Problem Concurrently launching multiple jobs on pools is currently slow and failure prone. The long time it takes is primarily due to us unnecessarily duplicating a lot of steps...