qifengz
qifengz
how to pass horovodrun's parameters like --host-discovery-script and --min-np when using mpirun command?
@gaocegege want to use it in mpijob. I tried like this `mpirun --allow-run-as-root -np 1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl...
met the same issue.
@asahalyft I think the main point of error logs is "the server could not find the requested resource (put mpijobs.kubeflow.org ...". It means being a lack of subresources in crd...
I met this issue too. I had add some debug logs as below: I0122 04:41:31.480076 13774 tree.go:119] Update device information I0122 04:41:31.486222 13774 tree.go:135] node 0, pid: [], memory: 0,...
After I removed the four lines then it works normally! [https://github.com/tkestack/gpu-manager/blob/808ff8c29a361f04499ff62242cd56e4f93089f6/pkg/services/allocator/nvidia/allocator.go#L452-L455](https://github.com/tkestack/gpu-manager/blob/808ff8c29a361f04499ff62242cd56e4f93089f6/pkg/services/allocator/nvidia/allocator.go#L452-L455)
@fighterhit @HeroBcat Got you, helpful!
@mYmNeo I also have this issue. As show above, “tencent.com/vcuda-memory: 32” means 8192MiB, but 16154MiB is used actually, so it cannot limit GPU memory only?
update: when I set tencent.com/vcuda-core: 99,it can limit GPU memory as expected. why not 100?