Wei-Lin Chiang

Results 111 comments of Wei-Lin Chiang

Closing this issue as TPU VM is now supported. https://skypilot.readthedocs.io/en/latest/reference/tpu.html

Logs from the conda test: ``` Collecting protobuf>=3.15.3 Downloading protobuf-4.21.1-cp37-abi3-manylinux2014_x86_64.whl (407 kB) ... Collecting protobuf>=3.15.3 Downloading protobuf-3.20.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB) ``` Looks like two packages require protobuf and for some reason,...

Ah now our protobuf fix is merged. I'm not sure how to test it with older master commit. Anyone knows how to do it? ``` Collecting protobuf

@Michaelvll yeah we can do that. But I guess typically the first few ssh attempts would always fail (e.g., the below aws instance I tried just now) so that wouldn’t...

We can reduce `READY_CHECK_INTERVAL` if we think `self.provider.non_terminated_nodes({})` is taking time. But I guess at most we can only save 1-2s out of 1-2mins?

> Sorry, my question was not clear enough. I was not asking about `ray up` for the first time, but `ray up` on the existing cluster. For example, `sky launch`...

I've tested the patch and it works well. before all my spot TPU preemption requires 15mins to recover, with this it becomes 5mins. I'll update code line number after https://github.com/skypilot-org/skypilot/pull/1133...

> Should we update "By default, use 1 K80..." here https://sky-proj-sky.readthedocs-hosted.com/en/latest/reference/interactive-nodes.html#interactive-nodes ? Fixed. Thanks for catching this! > A minor wording thing: > > > The cheapest AWS(g4dn.xlarge, {'T4': 1})...

> I've always used `p` instances for my workloads (even before I used sky), and now I have to additionally request quotas for `g` instances too to get `gpunode` to...

This is important. Will work on this during bug squash tomorrow.