Anindya Saha
Anindya Saha
sure @terrytangyuan The source code that is being referred in my test yaml posted above is from the examples directory of the horovod project itself https://github.com/horovod/horovod/blob/master/examples/tensorflow2/tensorflow2_keras_mnist.py P.S. I observe this...
@carmark @terrytangyuan @gaocegege As I read through the controller codes I understood from https://github.com/kubeflow/mpi-operator/blob/75f424a802dafb3662bc5c76b8f3c3cb60127fac/pkg/controllers/v1/mpi_job_controller.go#L471 this is where the syncing logic is written to kill the worker pods when the MPIJob...
Also, for context/completeness, I am installing only the mpi operator component of Kubeflow and not the entire Kubeflow installation. I applied the mpi controller as follows: ``` kustomize build manifests-v1.1.0-branch/mpi-job/mpi-operator/overlays/application/...
@carmark @terrytangyuan @gaocegege Following up on this thread to check if you could 👀 and help in investigating the issue.
Thanks @qifengz . @carmark @chongchuanbing Weird. I have moved away from Kubeflow kustomize and applying operator yaml directly now. The feedback on this issue had been extremely slow, so I...
Hi @Minyus were you able to run pytorch or pytorch lightning using mpi operator with horovod. Could you please share the yaml that you applied?
@tgaddair Hi Travis, would you have some thoughts/insights on this?
Yes @tgaddair When I specify only GPU e.g. ``` resources: limits: nvidia.com/gpu: 1 ``` I am definitely getting the GPUs and each epoch takes only `3secs per epoch`. I have...
@tgaddair sure, let me do that experiment and come back.
Hi @tgaddair, I conducted two experiments. Here are the detailed experimental results: **TLDR;** In both cases GPUs are detected. I have tested the same tf2 script in both cases. However,...