mpi-operator
mpi-operator copied to clipboard
kubeflow.org/v1 didn't create pods when running examples
Hi, I'm trying examples with kubeflow.org/v1
kubectl create -f examples/v1/tensorflow-benchmarks.yaml
kubectl create -f examples/horovod/tensorflow_mnist.py
both created mpijob, but there are no pods created, and it just hang
(base) heyfey@gpu3:~/mpi-operator$ kubectl get mpijob
NAME AGE
tensorflow-benchmarks 12m
tensorflow-mnist 61m
(base) heyfey@gpu3:~/mpi-operator$ kubectl get pods
No resources found in default namespace.
While when I tried kubeflow.org/v1alpha2, and
kubectl create -f examples/v1alpha2/tensorflow-benchmarks.yaml
everything worked as expected.
Am I doing something wrong? Thanks
Could you do a kubectl describe
on your mpijobs?
(base) heyfey@gpu3:~/mpi-operator$ kubectl describe mpijob tensorflow-benchmarks
Name: tensorflow-benchmarks
Namespace: default
Labels: <none>
Annotations: <none>
API Version: kubeflow.org/v1
Kind: MPIJob
Metadata:
Creation Timestamp: 2021-03-20T15:37:12Z
Generation: 1
Managed Fields:
API Version: kubeflow.org/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
.:
f:cleanPodPolicy:
f:mpiReplicaSpecs:
.:
f:Launcher:
.:
f:replicas:
f:template:
.:
f:spec:
.:
f:containers:
f:Worker:
.:
f:replicas:
f:template:
.:
f:spec:
.:
f:containers:
f:slotsPerWorker:
Manager: kubectl-create
Operation: Update
Time: 2021-03-20T15:37:12Z
Resource Version: 2919585
UID: cae4e071-d8ed-42ee-ad46-03f7967ebd08
Spec:
Clean Pod Policy: Running
Mpi Replica Specs:
Launcher:
Replicas: 1
Template:
Spec:
Containers:
Command:
mpirun
--allow-run-as-root
-np
2
-bind-to
none
-map-by
slot
-x
NCCL_DEBUG=INFO
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model=resnet101
--batch_size=64
--variable_update=horovod
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks
Worker:
Replicas: 2
Template:
Spec:
Containers:
Image: mpioperator/tensorflow-benchmarks:latest
Name: tensorflow-benchmarks
Resources:
Limits:
nvidia.com/gpu: 1
Slots Per Worker: 1
Events: <none>
(base) heyfey@gpu3:~/mpi-operator$ kubectl describe mpijob tensorflow-mnist
Name: tensorflow-mnist
Namespace: default
Labels: <none>
Annotations: <none>
API Version: kubeflow.org/v1
Kind: MPIJob
Metadata:
Creation Timestamp: 2021-03-20T14:48:34Z
Generation: 1
Managed Fields:
API Version: kubeflow.org/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
.:
f:cleanPodPolicy:
f:mpiReplicaSpecs:
.:
f:Launcher:
.:
f:replicas:
f:template:
.:
f:spec:
.:
f:containers:
f:Worker:
.:
f:replicas:
f:template:
.:
f:spec:
.:
f:containers:
f:slotsPerWorker:
Manager: kubectl-create
Operation: Update
Time: 2021-03-20T14:48:34Z
Resource Version: 2914822
UID: c645f41b-d9d7-45f8-84ba-e5c6a98b6241
Spec:
Clean Pod Policy: Running
Mpi Replica Specs:
Launcher:
Replicas: 1
Template:
Spec:
Containers:
Args:
-np
2
--allow-run-as-root
-bind-to
none
-map-by
slot
-x
LD_LIBRARY_PATH
-x
PATH
-mca
pml
ob1
-mca
btl
^openib
python
/examples/tensorflow_mnist.py
Command:
mpirun
Image: docker.io/kubeflow/mpi-horovod-mnist
Name: mpi-launcher
Resources:
Limits:
Cpu: 1
Memory: 2Gi
Worker:
Replicas: 2
Template:
Spec:
Containers:
Image: docker.io/kubeflow/mpi-horovod-mnist
Name: mpi-worker
Resources:
Limits:
Cpu: 2
Memory: 4Gi
Slots Per Worker: 1
Events: <none>
This issue came to me before occasionally. @heyfey Would you mind provide the log of the mpi controller? If you need further help, please reach me via [email protected]
It seems in the deploy file for v1 specifies the monitored namespace to be mpi-operator
: https://github.com/kubeflow/mpi-operator/blob/master/deploy/v1/mpi-operator.yaml#L198
While this configuration looks fine, it is not consistent with examples created under default namespace.
@terrytangyuan do you think we should remove the --namespace
arguments?
There is also an arg for lock namespace so we may want to modify both of them. When users install MPI operator, they should be aware and modify these args if needed though. Perhaps we need add some notes in the installation documentation and add more informative log in controller (instead of removing them)?
Yes, a note will be much better to specify how to configure the deploy of mpi-operator.
I met this today. ^_^
In addition to adding more informative log and a note, I think adding the namespace in examples/v1/tensorflow-benchmarks.yaml
is also needed. Otherwise everyone has to modify yaml.
I met this issue today. I have try this two method: 1) using cmd: kubectl apply -f examples/v1/tensorflow-benchmarks.yaml -n mpi-operator 2) add "namespace: mpi-operaor " into the yaml file. They all not work. Please give me more suggestion. Thanks.
v1 is out of support in this repo now.
Please upgrade to v2beta1 or open an issue in kubeflow/training-operator
/close
@alculquicondor: Closing this issue.
In response to this:
v1 is out of support in this repo now.
Please upgrade to v2beta1 or open an issue in kubeflow/training-operator
/close
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.