skypilot
skypilot copied to clipboard
SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface.
Kevin's question: I asked for V100 but Sky kept spending minutes on regions that I don't have quota. Is there any way to specify regions for Sky to prioritize? (my...
The logging messages for the docker backend are quite redundant. For example, when running `examples/minimal.yaml`, Sky prints ```bash I 03-16 23:15:55 local_docker_backend.py:223] Image minimal found. Running container now. use_gpu is...
When I tried to start a multi-node gcp cluster, during my debugging, I encountered the following error, which may indicate the GCP has some per minute request limit quota. ```...
Currently Sky only supports TPU nodes, but an investigation by @infwinston and myself with our JAX experiments showed that Google is pushing all support to TPU VMs (see: https://cloud.google.com/blog/products/compute/introducing-cloud-tpu-vms) for...
We currently do not support provisioning multiple TPUs for a cluster, only the head node will have access to the provisioned TPU.
I have a docker cluster in my cluster table and quit the docker app on my laptop. When I try to run `sky down test-docker`, the following error appears. We...
From Daniel >It's been a little while now, and the confirmation prompts have not grown on me. I think a confirmation prompt makes sense when you run a command and...
I've seen multiple times that Ray autoscaler wasting 15+mins during spot recovery on ssh login to a dead VM on GCP. This brings a significant delay for spot recovery (from...
Very occasionally, AWS / ray launcher will error with "Failed to create security group", and this line is very hard to find in a wall of callback: ``` I 05-12...