Andy Lee
Andy Lee
Extract repeated task failure state update logic into `_update_failed_task_state()` helper to reduce code duplication. Good catch! Could submit this change to master first. _Originally posted by @cblmemo in https://github.com/skypilot-org/skypilot/pull/4128#discussion_r1817404695_ Tested...
### Background Initially, a process pool was used in `server.controller` to handle replica launching and termination due to limitations with redirecting `sys.stdout` and `sys.stderr` in multi-threaded environments. The global nature...
Currently, replicas can get stuck during shutdown, requiring manual intervention into code logic. This is very unfriendly to users. We propose implement a timeout mechanism for replica termination. Replicas that...
Our optimizer currently removes sink and dummy nodes without checking if they're still depended on by other nodes. This violates DAG rules and forces us to use a workaround in...
# Implement Automated Weekly Smoke Tests ## Problem Currently, smoke tests for SkyPilot (implemented in `test_smoke.py`) are being run manually. This process could be improved by automating these tests on...
Fixes #4112 Implement an automated weekly smoke test run using GitHub Actions leveraging the existing `tests/test_smoke.py` script. * Add a new GitHub Actions workflow file `.github/workflows/weekly-smoke-tests.yml`. * Schedule the workflow...