Wei-Lin Chiang
Wei-Lin Chiang
Bug report from Daniel: `nvidia-smi` doesn't work on `sky gpunode --cloud gcp`. ``` NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA...
This PR enables TPU Pod usage. To change from single TPU and TPU pod, user only needs to modify `accelerators: tpu-v2-8` to `accelerators: tpu-v2-32`. `sky launch` and `sky exec` will...
We met the issue during the onboarding session with Wilson. This is due to the ssh config added by SkyPilot was overwritten by a global rule set by user by...
Mentioned by @Michaelvll in https://github.com/skypilot-org/skypilot/pull/1014#discussion_r940537044, below scenario may trigger an unexpected behavior of `sky start`. 1. User `sky launch gpunode` and got a VM in `us-west-1a` 2. User `sky stop`...
This PR aims to fix https://github.com/sky-proj/sky/issues/849 by patching Ray autoscaler. Detailed explained in the issue. We think a simple way is to patch Ray's autoscaler by adding `self.provider.non_terminated_nodes({})` before https://github.com/ray-project/ray/blob/6d978ab10ec65da1018790f8605b5b8946e838e5/python/ray/autoscaler/_private/updater.py#L272....
As discussed in https://github.com/sky-proj/sky/issues/700, T4 seems to be a better choice as the default gpu node. Please comment below if you have any additional thoughts. Copied from the discussion. >...
To avoid issues like https://github.com/sky-proj/sky/issues/879 in the future. We didn't catch it because we use Python 3.6 in our test, with that version the latest protobuf won't be installed.
Feedback from Daniel: "only three AWS regions are supported." We may consider support more regions to increase Sky's availability. For example, [P3 instance](https://aws.amazon.com/ec2/instance-types/p3/) is available in 14 regions including some...
Kevin's question: I asked for V100 but Sky kept spending minutes on regions that I don't have quota. Is there any way to specify regions for Sky to prioritize? (my...
I've seen multiple times that Ray autoscaler wasting 15+mins during spot recovery on ssh login to a dead VM on GCP. This brings a significant delay for spot recovery (from...