skypilot icon indicating copy to clipboard operation
skypilot copied to clipboard

SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.

Results 530 skypilot issues
Sort by recently updated
recently updated
newest added

Our current optimizer does not consider regions. That makes the data movement between different regions not taken into consideration, especially when we are trying to have chain dags or recovery...

enhancement

`download_remote_dir` doesn't work. Printed error: ```python Traceback (most recent call last): File "test.py", line 3, in list(storage.stores.values())[0].download_remote_dir('') File "/Users/woosuk/workspace/sky-proj/sky/sky/data/storage.py", line 178, in download_remote_dir iterator = self._remote_filepath_iterator() AttributeError: 'S3Store' object has...

From @concretevitamin https://github.com/sky-proj/sky/pull/792/files#diff-269b4c3571c0c71d64c2b827b9223a1dfc49a0c21a99bbb396295c520d46ddd2R13 I ran this, and saw ``` (sky-98fa-zongheng pid=875760) I 05-03 19:53:23 storage.py:913] Created S3 bucket checkpoint-bucket-bert in us-east-2 (sky-98fa-zongheng pid=875760) I 05-03 19:53:23 cloud_vm_ray_backend.py:1222] Creating a new...

It would be great to have a tutorial in our document for using TPU nodes and TPU VMs in our document. Currently, we only have a single line in the...

P0

I found the azure cluster sometimes (about 2/5 chances) fails due to `Error: Head node fetch timed out. Failed to create head node.`. It may be a ray-related bug, but...

For many internal operators, we will call the `refresh_cluster_status_handle` to get the actual status of the cluster to fail early. Currently, we only apply the actual refresh when the `autostop`...

investigate

The original code tried to reuse a cached head node ip address, but this does not actually work well when multiple processes are accessing the same node. Also the reusing...

From Ritwik's onborading: "it’s counter-intuitive to me `sky status` doesn’t refresh by default." We can make it refresh correctly by talking with the cloud provider. Currently refresh simply pings -...

Initial-User-Issue
friction-log

Milestone "100 jobs": Launch 100 jobs, each taking its own cluster. ```bash #!/bin/bash set -ex for i in $(seq 100); do sky launch -y -d --cloud=aws -c node-"$i" --gpus=K80 >node-"$i".log...

Replication: Run `sky launch -c hello storage.yaml` with a filemount, while the Storage object is syncing, run `sky storage delete ...`