mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

kubeflow.org/v1 didn't create pods when running examples

Open heyfey opened this issue 3 years ago • 7 comments

Hi, I'm trying examples with kubeflow.org/v1

kubectl create -f examples/v1/tensorflow-benchmarks.yaml
kubectl create -f examples/horovod/tensorflow_mnist.py

both created mpijob, but there are no pods created, and it just hang

(base) heyfey@gpu3:~/mpi-operator$ kubectl get mpijob
NAME                    AGE
tensorflow-benchmarks   12m
tensorflow-mnist        61m
(base) heyfey@gpu3:~/mpi-operator$ kubectl get pods
No resources found in default namespace.

While when I tried kubeflow.org/v1alpha2, and

kubectl create -f examples/v1alpha2/tensorflow-benchmarks.yaml

everything worked as expected.

Am I doing something wrong? Thanks

heyfey avatar Mar 20 '21 16:03 heyfey

Could you do a kubectl describe on your mpijobs?

terrytangyuan avatar Mar 20 '21 19:03 terrytangyuan

(base) heyfey@gpu3:~/mpi-operator$ kubectl describe mpijob tensorflow-benchmarks
Name:         tensorflow-benchmarks
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2021-03-20T15:37:12Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:cleanPodPolicy:
        f:mpiReplicaSpecs:
          .:
          f:Launcher:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
          f:Worker:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
        f:slotsPerWorker:
    Manager:         kubectl-create
    Operation:       Update
    Time:            2021-03-20T15:37:12Z
  Resource Version:  2919585
  UID:               cae4e071-d8ed-42ee-ad46-03f7967ebd08
Spec:
  Clean Pod Policy:  Running
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Command:
              mpirun
              --allow-run-as-root
              -np
              2
              -bind-to
              none
              -map-by
              slot
              -x
              NCCL_DEBUG=INFO
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
              --model=resnet101
              --batch_size=64
              --variable_update=horovod
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks
    Worker:
      Replicas:  2
      Template:
        Spec:
          Containers:
            Image:  mpioperator/tensorflow-benchmarks:latest
            Name:   tensorflow-benchmarks
            Resources:
              Limits:
                nvidia.com/gpu:  1
  Slots Per Worker:              1
Events:                          <none>
(base) heyfey@gpu3:~/mpi-operator$ kubectl describe mpijob tensorflow-mnist
Name:         tensorflow-mnist
Namespace:    default
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         MPIJob
Metadata:
  Creation Timestamp:  2021-03-20T14:48:34Z
  Generation:          1
  Managed Fields:
    API Version:  kubeflow.org/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:cleanPodPolicy:
        f:mpiReplicaSpecs:
          .:
          f:Launcher:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
          f:Worker:
            .:
            f:replicas:
            f:template:
              .:
              f:spec:
                .:
                f:containers:
        f:slotsPerWorker:
    Manager:         kubectl-create
    Operation:       Update
    Time:            2021-03-20T14:48:34Z
  Resource Version:  2914822
  UID:               c645f41b-d9d7-45f8-84ba-e5c6a98b6241
Spec:
  Clean Pod Policy:  Running
  Mpi Replica Specs:
    Launcher:
      Replicas:  1
      Template:
        Spec:
          Containers:
            Args:
              -np
              2
              --allow-run-as-root
              -bind-to
              none
              -map-by
              slot
              -x
              LD_LIBRARY_PATH
              -x
              PATH
              -mca
              pml
              ob1
              -mca
              btl
              ^openib
              python
              /examples/tensorflow_mnist.py
            Command:
              mpirun
            Image:  docker.io/kubeflow/mpi-horovod-mnist
            Name:   mpi-launcher
            Resources:
              Limits:
                Cpu:     1
                Memory:  2Gi
    Worker:
      Replicas:  2
      Template:
        Spec:
          Containers:
            Image:  docker.io/kubeflow/mpi-horovod-mnist
            Name:   mpi-worker
            Resources:
              Limits:
                Cpu:     2
                Memory:  4Gi
  Slots Per Worker:      1
Events:                  <none>

heyfey avatar Mar 21 '21 02:03 heyfey

This issue came to me before occasionally. @heyfey Would you mind provide the log of the mpi controller? If you need further help, please reach me via [email protected]

zw0610 avatar Mar 21 '21 03:03 zw0610

It seems in the deploy file for v1 specifies the monitored namespace to be mpi-operator: https://github.com/kubeflow/mpi-operator/blob/master/deploy/v1/mpi-operator.yaml#L198

While this configuration looks fine, it is not consistent with examples created under default namespace.

@terrytangyuan do you think we should remove the --namespace arguments?

zw0610 avatar Mar 21 '21 05:03 zw0610

There is also an arg for lock namespace so we may want to modify both of them. When users install MPI operator, they should be aware and modify these args if needed though. Perhaps we need add some notes in the installation documentation and add more informative log in controller (instead of removing them)?

terrytangyuan avatar Mar 21 '21 13:03 terrytangyuan

Yes, a note will be much better to specify how to configure the deploy of mpi-operator.

zw0610 avatar Mar 22 '21 02:03 zw0610

I met this today. ^_^ In addition to adding more informative log and a note, I think adding the namespace in examples/v1/tensorflow-benchmarks.yaml is also needed. Otherwise everyone has to modify yaml.

denkensk avatar Mar 31 '21 13:03 denkensk

I met this issue today. I have try this two method: 1) using cmd: kubectl apply -f examples/v1/tensorflow-benchmarks.yaml -n mpi-operator 2) add "namespace: mpi-operaor " into the yaml file. They all not work. Please give me more suggestion. Thanks.

inspurasc avatar Mar 22 '23 03:03 inspurasc

v1 is out of support in this repo now.

Please upgrade to v2beta1 or open an issue in kubeflow/training-operator

/close

alculquicondor avatar Mar 22 '23 13:03 alculquicondor

@alculquicondor: Closing this issue.

In response to this:

v1 is out of support in this repo now.

Please upgrade to v2beta1 or open an issue in kubeflow/training-operator

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

google-oss-prow[bot] avatar Mar 22 '23 13:03 google-oss-prow[bot]