Rong Ou

Results 41 comments of Rong Ou

I really don't know much about PMIx. If you are interested, you can try to prototype a solution. Right now we start the worker pods and sleep, the launcher than...

If you write your checkpoints and event files to a shared location (NFS, s3, GCS, etc.), you can just point tensorboard to it. There is tensorboard support in Pipelines: https://www.kubeflow.org/docs/pipelines/sdk/output-viewer/#tensorboard,...

If you look at https://github.com/kubeflow/mpi-operator/blob/master/deploy/2-rbac.yaml, it shows all the permissions you need.

lgtm On Wed, Mar 9, 2022 at 1:22 PM Thea Lamkin ***@***.***> wrote: > The best place to submit is as a PR to the following directory: > https://github.com/kubeflow/community/tree/master/proposals >...

You need to set `cleanPodPolicy` to clean up the worker pods. See https://github.com/kubeflow/mpi-operator/blob/master/examples/v1alpha2/tensorflow-benchmarks.yaml#L7.

FWIW I think it's a good idea. The current solution has always been a "temporary" one, eventually it'd be nice to implement a "native" solution (e.g. through PMIx #12).

That's strange. Did you do anything with the cluster during those 3 hours?

Do you have the logs from `mpi-operator`?

@omesser you are welcome to send us a pull request. :)

The launcher's behavior is similar to a batch job. Looks like the log issue is with batch job itself. If you run a standalone batch job, you'd run into the...