skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.

Results 530 skypilot issues
Sort by recently updated
recently updated
newest added

With the new provisioner, we are able to fail over different availability zones pretty fast. Thus, we do not need to assign multiple zones for AWS during each provisioning attempt....

Stale

- [x] Docs for the spot pipeline - [ ] Story for how to debug the pipeline. Currently, if the 4th task failed, the user has to restart the entire...

enhancement
Stale

RuntimeError: Failed to SSH to 38.80.122.92 after timeout 600s, with Error: /etc/ssh/ssh_config: line 26: Bad configuration option: permitrootlogin _Version & Commit info:_ * `sky -v`: skypilot, version 0.5.0 * `sky...

~~Blocked by #3696, #3700~~ ## Single-node **master** ([05ce5e9](https://github.com/skypilot-org/skypilot/commit/05ce5e999a5c4218d267481ebddac7967dce1897)) ``` multitime -n 5 sky launch --cloud azure -y --cpus 2 --down Mean Std.Dev. Min Median Max real 220.920 6.553 213.297 219.030...

Please consider implementing this for compute instances provided by OVH public cloud. Although they do not provide spot instances, but the limited edition instances by OVH can be used as...

clouds

Currently, we have a lock for each submission of the spot job, we should make it more efficient. One way to test this is to submitting more than 100 spot...

friction-log
spot

To reproduce: `sky launch -c test-k8s --cloud kubernetes "conda install -c conda-forge google-cloud-sdk" -y`

bug

Currently, to get the IP of a cluster in the python API is rather complicated: ```python ip = sky.status('cluster-name')[0]['handle'].external_ip() ``` Similarly for the endpoint of service: ```python service_records = sky.serve.status('code-llama')...

feature-request

We should consider adding support for AMD GPUs, which have been tested to be efficient for ML workloads. References: https://www.amd.com/en/technologies/deep-machine-learning https://www.lamini.ai/blog/lamini-amd-paving-the-road-to-gpu-rich-enterprise-llms https://blog.mlc.ai/2023/08/09/Making-AMD-GPUs-competitive-for-LLM-inference https://www.mosaicml.com/blog/amd-mi250

enhancement

The following command will go ahead to launch the cluster, but fail after the cluster is launched. We should check the existence of the source in filemounts, before launching the...

friction-log