skypilot
skypilot copied to clipboard
SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
With the new provisioner, we are able to fail over different availability zones pretty fast. Thus, we do not need to assign multiple zones for AWS during each provisioning attempt....
- [x] Docs for the spot pipeline - [ ] Story for how to debug the pipeline. Currently, if the 4th task failed, the user has to restart the entire...
RuntimeError: Failed to SSH to 38.80.122.92 after timeout 600s, with Error: /etc/ssh/ssh_config: line 26: Bad configuration option: permitrootlogin _Version & Commit info:_ * `sky -v`: skypilot, version 0.5.0 * `sky...
~~Blocked by #3696, #3700~~ ## Single-node **master** ([05ce5e9](https://github.com/skypilot-org/skypilot/commit/05ce5e999a5c4218d267481ebddac7967dce1897)) ``` multitime -n 5 sky launch --cloud azure -y --cpus 2 --down Mean Std.Dev. Min Median Max real 220.920 6.553 213.297 219.030...
Please consider implementing this for compute instances provided by OVH public cloud. Although they do not provide spot instances, but the limited edition instances by OVH can be used as...
Currently, we have a lock for each submission of the spot job, we should make it more efficient. One way to test this is to submitting more than 100 spot...
To reproduce: `sky launch -c test-k8s --cloud kubernetes "conda install -c conda-forge google-cloud-sdk" -y`
Currently, to get the IP of a cluster in the python API is rather complicated: ```python ip = sky.status('cluster-name')[0]['handle'].external_ip() ``` Similarly for the endpoint of service: ```python service_records = sky.serve.status('code-llama')...
We should consider adding support for AMD GPUs, which have been tested to be efficient for ML workloads. References: https://www.amd.com/en/technologies/deep-machine-learning https://www.lamini.ai/blog/lamini-amd-paving-the-road-to-gpu-rich-enterprise-llms https://blog.mlc.ai/2023/08/09/Making-AMD-GPUs-competitive-for-LLM-inference https://www.mosaicml.com/blog/amd-mi250
The following command will go ahead to launch the cluster, but fail after the cluster is launched. We should check the existence of the source in filemounts, before launching the...