mpi-operator kubeflow.org/v1 didn't create pods when running examples

kubeflow.org/v1 didn't create pods when running examples

Open heyfey opened this issue 3 years ago • 7 comments

Hi, I'm trying examples with kubeflow.org/v1

kubectl create -f examples/v1/tensorflow-benchmarks.yaml

kubectl create -f examples/horovod/tensorflow_mnist.py

both created mpijob, but there are no pods created, and it just hang

(base) heyfey@gpu3:~/mpi-operator$ kubectl get mpijob
NAME                    AGE
tensorflow-benchmarks   12m
tensorflow-mnist        61m
(base) heyfey@gpu3:~/mpi-operator$ kubectl get pods
No resources found in default namespace.

While when I tried kubeflow.org/v1alpha2, and

kubectl create -f examples/v1alpha2/tensorflow-benchmarks.yaml

everything worked as expected.

Am I doing something wrong? Thanks

Mar 20 '21 16:03 heyfey

Could you do a kubectl describe on your mpijobs?

Mar 20 '21 19:03 terrytangyuan

(base) heyfey@gpu3:~/mpi-operator$ kubectl describe mpijob tensorflow-benchmarks
Name:         tensorflow-benchmarks
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2021-03-20T15:37:12Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:cleanPodPolicy:
        f:mpiReplicaSpecs:
          .:
          f:Launcher:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
          f:Worker:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
        f:slotsPerWorker:
    Manager:         kubectl-create
    Operation:       Update
    Time:            2021-03-20T15:37:12Z
  Resource Version:  2919585
  UID:               cae4e071-d8ed-42ee-ad46-03f7967ebd08
Spec:
  Clean Pod Policy:  Running
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Command:
              mpirun
              --allow-run-as-root
              -np
              2
              -bind-to
              none
              -map-by
              slot
              -x
              NCCL_DEBUG=INFO
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
              --model=resnet101
              --batch_size=64
              --variable_update=horovod
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks
    Worker:
      Replicas:  2
      Template:
        Spec:
          Containers:
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks
            Resources:
              Limits:
                nvidia.com/gpu:  1
  Slots Per Worker:              1
Events:                          <none>

(base) heyfey@gpu3:~/mpi-operator$ kubectl describe mpijob tensorflow-mnist
Name:         tensorflow-mnist
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2021-03-20T14:48:34Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:cleanPodPolicy:
        f:mpiReplicaSpecs:
          .:
          f:Launcher:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
          f:Worker:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
        f:slotsPerWorker:
    Manager:         kubectl-create
    Operation:       Update
    Time:            2021-03-20T14:48:34Z
  Resource Version:  2914822
  UID:               c645f41b-d9d7-45f8-84ba-e5c6a98b6241
Spec:
  Clean Pod Policy:  Running
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Args:
              -np
              2
              --allow-run-as-root
              -bind-to
              none
              -map-by
              slot
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              /examples/tensorflow_mnist.py
            Command:
              mpirun
            Image:  docker.io/kubeflow/mpi-horovod-mnist
            Name:   mpi-launcher
            Resources:
              Limits:
                Cpu:     1
                Memory:  2Gi
    Worker:
      Replicas:  2
      Template:
        Spec:
          Containers:
            Image:  docker.io/kubeflow/mpi-horovod-mnist
            Name:   mpi-worker
            Resources:
              Limits:
                Cpu:     2
                Memory:  4Gi
  Slots Per Worker:      1
Events:                  <none>

Mar 21 '21 02:03 heyfey

This issue came to me before occasionally. @heyfey Would you mind provide the log of the mpi controller? If you need further help, please reach me via [email protected]

Mar 21 '21 03:03 zw0610

It seems in the deploy file for v1 specifies the monitored namespace to be mpi-operator: https://github.com/kubeflow/mpi-operator/blob/master/deploy/v1/mpi-operator.yaml#L198

While this configuration looks fine, it is not consistent with examples created under default namespace.

@terrytangyuan do you think we should remove the --namespace arguments?

Mar 21 '21 05:03 zw0610

There is also an arg for lock namespace so we may want to modify both of them. When users install MPI operator, they should be aware and modify these args if needed though. Perhaps we need add some notes in the installation documentation and add more informative log in controller (instead of removing them)?

Mar 21 '21 13:03 terrytangyuan

Yes, a note will be much better to specify how to configure the deploy of mpi-operator.

Mar 22 '21 02:03 zw0610

I met this today. ^_^ In addition to adding more informative log and a note, I think adding the namespace in examples/v1/tensorflow-benchmarks.yaml is also needed. Otherwise everyone has to modify yaml.

Mar 31 '21 13:03 denkensk

I met this issue today. I have try this two method: 1) using cmd: kubectl apply -f examples/v1/tensorflow-benchmarks.yaml -n mpi-operator 2) add "namespace: mpi-operaor " into the yaml file. They all not work. Please give me more suggestion. Thanks.

Mar 22 '23 03:03 inspurasc

v1 is out of support in this repo now.

Please upgrade to v2beta1 or open an issue in kubeflow/training-operator

/close

Mar 22 '23 13:03 alculquicondor

@alculquicondor: Closing this issue.

In response to this:

v1 is out of support in this repo now.

Please upgrade to v2beta1 or open an issue in kubeflow/training-operator

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Mar 22 '23 13:03 google-oss-prow[bot]

mpi-operator mpi-operator copied to clipboard

kubeflow.org/v1 didn't create pods when running examples

mpi-operator
mpi-operator copied to clipboard