Seung Jin comments

Results 36 comments of


                                            Seung Jin

How to pack Skypilot jobs and clusters onto GPU nodes with Kubernetes?

Another case to consider: In multi-instance cluster scenario, we want different pods belonging to the same cluster to be scheduled on different nodes if possible.

Fix flaky of `test_multi_echo` -- change sshd config to support large number of jobs

Looks pretty good to me! I'll approve once the tests are passing (which it should after a rebase) and the use of `reload` instead of `restart` is considered and decided...

[k8s] Force terminate misbehaving pods

For personal reference, how `--grace-period=0 --force` translates into python API: https://github.com/kubernetes-client/python/issues/508

[k8s] Force terminate misbehaving pods

It _seems_ like a nonzero `grace_period` might not act like `--force` based on [here](https://github.com/kubernetes-client/python/issues/508#issuecomment-1695759777) and [here](https://github.com/kubernetes/kubectl/blob/826006cdb947f80a679ff1eb3cb53f183a6a9bf2/pkg/cmd/delete/delete.go#L285-L286) - is there a reason 10 seconds was chosen as the grace period?

[k8s][gcp] Accept non-k8s TPU names

Yes! I've assigned you this issue, feel free to give a go at it.

[k8s] idea: allow an accelerator to map to multiple label values

/smoke-test --kubernetes

[k8s] idea: allow an accelerator to map to multiple label values

> nice, like it. do we have any test coverage specifically on the labels/nodeselector code? No on unit tests because the codepaths here does need a k8s cluster to interact...

[UX][k8s] show-gpus for all allowed contexts

Re: UI, I do agree on having a table showing aggregated GPU availability across all clusters. I actually think such table should be at the top, because the current UI...

[UX][k8s] show-gpus for all allowed contexts

I actually think that if a node doesn't contain any GPU then it shouldn't show up on the table

[storage] storage upload failing due to "Argument list too long"

Tried reproducing this with: ``` file_mounts: /cloudflare: name: source: ~/yamls store: r2 mode: MOUNT ``` This command actually errors out for me with `upload failed: ../../yamls/cloudflare.yaml to s3:///cloudflare.yaml An error...