Wang Zhang

Results 35 comments of Wang Zhang

Could you get from APIServer via `kubectl` instead of the source file you posted? There might be some legacy jobs or mutating webhook that makes the job in the cluster...

I'm afraid I cannot link this mpijob to the error message on the top unless there are other mpijobs.

so looking forward to this feature!

Hi @sjeaugey , would you mind offering a checklist to clear the path to use NVLink for 2 containers within the same node? This is kind of critical for us...

> @ctuluhu @troycheng @unclepeddy one way to mitigate the problem is to use environment flag `TF_FORCE_GPU_ALLOW_GROWTH=true` when you launch your model server. It'll grab minimum required GPU memory at startup...