Zhanghao Wu
Zhanghao Wu
Cloud logging service is a great idea! For the distributed jobs, we can save the aggregated driver log on the head node to the logging service.
This is also related to #3013
It might work with `-p`?
@Maknee did we test this PR with NCCL test? Seems there is no nccl test related file in this PR
Hi @wemoveon2, thank you for your interest! It would be awesome if you could help the support of vast.ai. The followings are some clouds we recently added: Cudo: https://github.com/skypilot-org/skypilot/pull/2975 Fluidstack:...
Added in #4365! A lot thanks to the great work from @kristopolous! Closing this issue now.
Thanks for requesting this feature @zaptrem! I re-opened the issue.
I am trying the latest master with this PR in. The UX for a single k8s with GPUs seems a bit weird to me: 1. Should we avoid the `\n\n`...
This PR also seems causing a backward compatibility issue due to the return value of the request from API server changes. ``` _get_kubernetes_realtime_gpu_table gpu_availability = models.RealtimeGpuAvailability( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: RealtimeGpuAvailability.__new__() missing...
Another UX feedback: It seems currently we show: ``` Total table Context 1 GPUs Context 1 Nodes Context 2 GPUs Context 2 Nodes ... ``` For SkyPilot users, GPUs are...