Zhanghao Wu
Zhanghao Wu
It should be great if we can let the admin deploy the on-prem cluster with this! When I was trying to launch my on-prem server with a newly launched AWS...
With the #911, we added a per cluster status lock for any cluster status cache update. Maybe we can use that lock to get rid of the race condition here?
For the list of strings, maybe we can take a look at how ray implements it in `DockerCommandRunner`. https://github.com/ray-project/ray/blob/92781c603e4fe02af986d879a007b9e905d9c65a/python/ray/autoscaler/_private/updater.py#L460-L479 https://github.com/ray-project/ray/blob/92781c603e4fe02af986d879a007b9e905d9c65a/python/ray/autoscaler/_private/command_runner.py#L625
> Seeing the following issue. The controller is managing a running spot job. I ran `sky autostop --all -i 1`. Then, the controller has been stopped, despite the job still...
> UX comments for `sky spot status`: > > 1. SUBMITTED and STARTED become the same very soon (e.g., `7 hrs ago`). One way is to show detailed timestamps (2022-05-05...
> See #640 for a possible solution? The solution may not be sufficient, as we cannot change the `availability_zones` field to the zone where the old cluster was on (otherwise,...
> @Michaelvll Just to clarify the behavior of Ray, when we `ray up` a stopped VM in `us-east-1b` with > > ``` > availability_zone: us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f > ``` > > and...
One solution is to update the `ray-launch-hash` tag of the instance of the config that only has the `AvailabililtyZone` that the launched instance actually belongs to after the instance is...
Seems the `RequestLimitExceeded` also happens for GCP as mentioned in #586. We may need to find a way to retry it.
Another problem is found: when the `~/.ssh/` has the wrong permission, sky launch will stuck in `launching` for a very long time. We may need to fail early for that...