Zhanghao Wu
Zhanghao Wu
This PR adds experimental support for chain dag, adapted from the original #267.
We need to use the service account credentials to manage the cloud storage. Otherwise, we could not upload data to the existing bucket with the following error. ``` Caught non-retryable...
When something wrong happens with the spot job, it would be nice to be able to log into the spot cluster to take a look at the problem. As proposed...
[Low priority] It would be great if we can have a place for the user to easily view the resource utilization (e.g. CPU, GPU, memory, etc.) of each cluster, e.g....
I encountered the following issue by using the following commands: ``` sky launch -c test-tpu --gpus=tpu-v3-8 '' -y sky stop test-tpu sky down test-tpu -y ``` ``` E 09-07 23:33:31...
Since ray autoscaler will empty the `~/.config/gcloud/configurations/config_default` file uploaded by the file_mounts, there will be no active GCP account on the spot controller. Though the launching and termination of the...
Our task submission can encounter transient network issue during rsyncing the generated python program to the remote cluster. We can add retry for those operations.
The current behavior of our code is that we will set the spot job status for PENDING jobs to STARTING first before setting it to CANCLLED, which can be a...
cryptography package used in the `authentication.py` does not support `openssl>3.0`. It seems the newly created conda environment can have the `openssl>3.0` by default and cause the following error for sky...
A pilot user requested for more simultaneous spot jobs than the current 16 jobs due to our settings for each job taking 0.5 CPU on spot controller. We can start...