Zhanghao Wu

Results 315 comments of Zhanghao Wu

Cloud logging service is a great idea! For the distributed jobs, we can save the aggregated driver log on the head node to the logging service.

@Maknee did we test this PR with NCCL test? Seems there is no nccl test related file in this PR

Hi @wemoveon2, thank you for your interest! It would be awesome if you could help the support of vast.ai. The followings are some clouds we recently added: Cudo: https://github.com/skypilot-org/skypilot/pull/2975 Fluidstack:...

Added in #4365! A lot thanks to the great work from @kristopolous! Closing this issue now.

Thanks for requesting this feature @zaptrem! I re-opened the issue.

I am trying the latest master with this PR in. The UX for a single k8s with GPUs seems a bit weird to me: 1. Should we avoid the `\n\n`...

This PR also seems causing a backward compatibility issue due to the return value of the request from API server changes. ``` _get_kubernetes_realtime_gpu_table gpu_availability = models.RealtimeGpuAvailability( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: RealtimeGpuAvailability.__new__() missing...

Another UX feedback: It seems currently we show: ``` Total table Context 1 GPUs Context 1 Nodes Context 2 GPUs Context 2 Nodes ... ``` For SkyPilot users, GPUs are...