Zhanghao Wu comments

Results 315 comments of


                                            Zhanghao Wu

[Demo] Sky local mode using onprem + docker

It should be great if we can let the admin deploy the on-prem cluster with this! When I was trying to launch my on-prem server with a newly launched AWS...

Cache head node ip address

With the #911, we added a per cluster status lock for any cluster status cache update. Maybe we can use that lock to get rid of the race condition here?

Problems in using LocalDockerBackend for debugging setup

For the list of strings, maybe we can take a look at how ray implements it in `DockerCommandRunner`. https://github.com/ray-project/ray/blob/92781c603e4fe02af986d879a007b9e905d9c65a/python/ray/autoscaler/_private/updater.py#L460-L479 https://github.com/ray-project/ray/blob/92781c603e4fe02af986d879a007b9e905d9c65a/python/ray/autoscaler/_private/command_runner.py#L625

[Managed Spot] Features required in the managed spot

> Seeing the following issue. The controller is managing a running spot job. I ran `sky autostop --all -i 1`. Then, the controller has been stopped, despite the job still...

[Managed Spot] Features required in the managed spot

> UX comments for `sky spot status`: > > 1. SUBMITTED and STARTED become the same very soon (e.g., `7 hrs ago`). One way is to show detailed timestamps (2022-05-05...

sky start may unexpectedly launch a new VM

> See #640 for a possible solution? The solution may not be sufficient, as we cannot change the `availability_zones` field to the zone where the old cluster was on (otherwise,...

sky start may unexpectedly launch a new VM

> @Michaelvll Just to clarify the behavior of Ray, when we `ray up` a stopped VM in `us-east-1b` with > > ``` > availability_zone: us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1e,us-east-1f > ``` > > and...

sky start may unexpectedly launch a new VM

One solution is to update the `ray-launch-hash` tag of the instance of the config that only has the `AvailabililtyZone` that the launched instance actually belongs to after the instance is...

[100 jobs] Abnormal failover leading to duplicate instances

Seems the `RequestLimitExceeded` also happens for GCP as mentioned in #586. We may need to find a way to retry it.

Failed to ssh into a cluster launched by SkyPilot

Another problem is found: when the `~/.ssh/` has the wrong permission, sky launch will stuck in `launching` for a very long time. We may need to fail early for that...