qifengz

Results 14 comments of qifengz

how to pass horovodrun's parameters like --host-discovery-script and --min-np when using mpirun command?

@gaocegege want to use it in mpijob. I tried like this `mpirun --allow-run-as-root -np 1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl...

@asahalyft I think the main point of error logs is "the server could not find the requested resource (put mpijobs.kubeflow.org ...". It means being a lack of subresources in crd...

I met this issue too. I had add some debug logs as below: I0122 04:41:31.480076 13774 tree.go:119] Update device information I0122 04:41:31.486222 13774 tree.go:135] node 0, pid: [], memory: 0,...

After I removed the four lines then it works normally! [https://github.com/tkestack/gpu-manager/blob/808ff8c29a361f04499ff62242cd56e4f93089f6/pkg/services/allocator/nvidia/allocator.go#L452-L455](https://github.com/tkestack/gpu-manager/blob/808ff8c29a361f04499ff62242cd56e4f93089f6/pkg/services/allocator/nvidia/allocator.go#L452-L455)

@mYmNeo I also have this issue. As show above, “tencent.com/vcuda-memory: 32” means 8192MiB, but 16154MiB is used actually, so it cannot limit GPU memory only?

update: when I set tencent.com/vcuda-core: 99,it can limit GPU memory as expected. why not 100?