Thor Wu comments

Results 169 comments of


                                            Thor Wu

suspend/resume 3 times vcjob will directly failure

@haiker2011 Thanks for your report. So it's not so robust that current logic are not aware of why the job is restarted(due to mannual operation or job failed), right?

worker pod terminating too long after vcjob retry three times, with error FailedMount

@ldd91 Kindly to ask is there some process about this issue?

Batch node allocation for AI training

@sunyulin728 Hi, I think I'm aware of your scenario now. And I stand on your side dealing with the rationality of the requirement. Unfortunately, the `binpack` plugin only consider the...

Network Topology Aware Plugin

Sounds interesting！But it's maybe more complex than the given desgin. For example, network delay varies between different nodes. Also, it varies in different period to the same node. Maybe considering...

ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100

Thanks for your report and debug. The debug is meaningful and we will fix it as soon as possible.

ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100

Request more voice about how much should be considered as a block(default is 1M) which is suitable for all specified GPU cards.

ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100

> 100MB per block may work fine. Inference services usually cost hundreds to thousands MB memory(train services usually cost much more than this scale), so we actually do not care...

ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100

> See you 15:00.

ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100

> Is this issue resolved at present？ Not yet. We are considering for a graceful way to make the fix without modifing the gRPC directly.

ListAndWatch failed when managing large memory GPU such as NVIDIA Telas V100

> Any update for this issue? Not yet now. I'm sorry for developing another feature recently. Will fix that ASAP.