Anindya Saha

Lyft Inc. San Francisco, CA

Results 16 comments of


                                            Anindya Saha

MPIJob Pods shows status RUNNING despite MPIJob Completed

sure @terrytangyuan The source code that is being referred in my test yaml posted above is from the examples directory of the horovod project itself https://github.com/horovod/horovod/blob/master/examples/tensorflow2/tensorflow2_keras_mnist.py P.S. I observe this...

MPIJob Pods shows status RUNNING despite MPIJob Completed

@carmark @terrytangyuan @gaocegege As I read through the controller codes I understood from https://github.com/kubeflow/mpi-operator/blob/75f424a802dafb3662bc5c76b8f3c3cb60127fac/pkg/controllers/v1/mpi_job_controller.go#L471 this is where the syncing logic is written to kill the worker pods when the MPIJob...

MPIJob Pods shows status RUNNING despite MPIJob Completed

Also, for context/completeness, I am installing only the mpi operator component of Kubeflow and not the entire Kubeflow installation. I applied the mpi controller as follows: ``` kustomize build manifests-v1.1.0-branch/mpi-job/mpi-operator/overlays/application/...

MPIJob Pods shows status RUNNING despite MPIJob Completed

@carmark @terrytangyuan @gaocegege Following up on this thread to check if you could 👀 and help in investigating the issue.

MPIJob Pods shows status RUNNING despite MPIJob Completed

Thanks @qifengz . @carmark @chongchuanbing Weird. I have moved away from Kubeflow kustomize and applying operator yaml directly now. The feedback on this issue had been extremely slow, so I...

Request: Add a PyTorch example

Hi @Minyus were you able to run pytorch or pytorch lightning using mpi operator with horovod. Could you please share the yaml that you applied?

TF2 Jobs latches on to CPUs if both CPU and GPU are provided in the container resource requests/limit section

@tgaddair Hi Travis, would you have some thoughts/insights on this?

TF2 Jobs latches on to CPUs if both CPU and GPU are provided in the container resource requests/limit section

Yes @tgaddair When I specify only GPU e.g. ``` resources: limits: nvidia.com/gpu: 1 ``` I am definitely getting the GPUs and each epoch takes only `3secs per epoch`. I have...

TF2 Jobs latches on to CPUs if both CPU and GPU are provided in the container resource requests/limit section

@tgaddair sure, let me do that experiment and come back.

TF2 Jobs latches on to CPUs if both CPU and GPU are provided in the container resource requests/limit section

Hi @tgaddair, I conducted two experiments. Here are the detailed experimental results: **TLDR;** In both cases GPUs are detected. I have tested the same tf2 script in both cases. However,...

1
2
›