mpi-operator icon indicating copy to clipboard operation
mpi-operator copied to clipboard

MPI-Operator run example failed

Open q443048756 opened this issue 2 years ago • 8 comments

I setup the mpi-operator v0.4.0

and try to deploy the example: mpi-operator-0.4.0/examples/v2beta1/tensorflow-benchmarks/tensorflow-benchmarks.yaml

My k8s has three node, each node has a 3060 graphics card

but it seem can not run it correctly: 1、Using the default configuration, I don't see any pods starting, it should be a failure

apiVersion: kubeflow.org/v2beta1 kind: MPIJob metadata: name: tensorflow-benchmarks spec: slotsPerWorker: 1 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - image: mpioperator/tensorflow-benchmarks:latest name: tensorflow-benchmarks command: - mpirun - --allow-run-as-root - -np - "2" - -bind-to - none - -map-by - slot - -x - NCCL_DEBUG=INFO - -x - LD_LIBRARY_PATH - -x - PATH - -mca - pml - ob1 - -mca - btl - ^openib - python - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py - --model=resnet101 - --batch_size=64 - --variable_update=horovod Worker: replicas: 2 template: spec: containers: - image: mpioperator/tensorflow-benchmarks:latest name: tensorflow-benchmarks resources: limits: nvidia.com/gpu: 1 2、When replicas: 1, the pod starts normally,I suspect that the task cannot call the GPU across nodes.

apiVersion: kubeflow.org/v2beta1 kind: MPIJob metadata: name: tensorflow-benchmarks spec: slotsPerWorker: 1 runPolicy: cleanPodPolicy: Running mpiReplicaSpecs: Launcher: replicas: 1 template: spec: containers: - image: mpioperator/tensorflow-benchmarks:latest name: tensorflow-benchmarks command: - mpirun - --allow-run-as-root - -np - "1" - -bind-to - none - -map-by - slot - -x - NCCL_DEBUG=INFO - -x - LD_LIBRARY_PATH - -x - PATH - -mca - pml - ob1 - -mca - btl - ^openib - python - scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py - --model=resnet101 - --batch_size=64 - --variable_update=horovod Worker: replicas: 1 template: spec: containers: - image: mpioperator/tensorflow-benchmarks:latest name: tensorflow-benchmarks resources: limits: nvidia.com/gpu: 1

3、After the pod is started, the launcher reports an error 2023-10-25 09:53:08.464568: E tensorflow/c/c_api.cc:2184] Internal: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid Traceback (most recent call last): File "scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 73, in app.run(main) # Raises error on invalid flags, unlike tf.app.run() File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 300, in run _run_main(main, args) File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main sys.exit(main(argv)) File "scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 61, in main params = benchmark_cnn.setup(params) File "/tensorflow/benchmarks/scripts/tf_cnn_benchmarks/benchmark_cnn.py", line 3538, in setup with tf.Session(config=create_config_proto(params)) as sess: File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 1586, in init super(Session, self).init(target, graph, config=config) File "/usr/local/lib/python3.7/dist-packages/tensorflow/python/client/session.py", line 701, in init self._session = tf_session.TF_NewSessionRef(self._graph._c_graph, opts) tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:0 failed. Status: device kernel image is invalid

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.


mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[12892,1],0] Exit code: 1

q443048756 avatar Oct 25 '23 10:10 q443048756

Does the example without GPUs work fine?

https://github.com/kubeflow/mpi-operator/blob/master/examples/v2beta1/pi/pi.yaml

tenzen-y avatar Oct 25 '23 13:10 tenzen-y

Alternatively, did you install the nvidia drivers on the nodes?

alculquicondor avatar Oct 25 '23 14:10 alculquicondor

没有 GPU 的示例可以正常工作吗?

https://github.com/kubeflow/mpi-operator/blob/master/examples/v2beta1/pi/pi.yaml At the beginning, I have used other methods to test the GPU availability. The first point I mentioned is that the task cannot use multi-node GPU to work. When replicas: 2, the pod cannot be started. When replicas: 1, the pod can work normally. start up

https://github.com/kubeflow/mpi-operator/blob/master/examples/v2beta1/pi/pi.yaml:This example is successful

q443048756 avatar Oct 26 '23 00:10 q443048756

或者,您是否在节点上安装了 nvidia 驱动程序?

After installing it and I have tested it, the GPU is working normally.he first point I mentioned is that the task cannot use multi-node GPU to work. When replicas: 2, the pod cannot be started. When replicas: 1, the pod can work normally. start up

q443048756 avatar Oct 26 '23 00:10 q443048756

Uhm... interesting. Although it sounds like some networking problems that you need to work out with your provider. I don't think it's related to mpi-operator.

alculquicondor avatar Oct 26 '23 14:10 alculquicondor

@q443048756 It would be the cuda thing in the default image mpioperator/tensorflow-benchmarks:latest DO NOT compatible with your local environment, I suggest you to find the right base image from here https://hub.docker.com/r/nvidia/cuda and then build your own one.

kuizhiqing avatar Oct 30 '23 15:10 kuizhiqing

@q443048756 It would be the cuda thing in the default image mpioperator/tensorflow-benchmarks:latest DO NOT compatible with your local environment, I suggest you to find the right base image from here https://hub.docker.com/r/nvidia/cuda and then build your own one.

It sounds reasonable. Thank you for the helping.

tenzen-y avatar Oct 30 '23 17:10 tenzen-y

It is the same failure, maybe it is time to update the example.

wang-mask avatar Jan 09 '24 02:01 wang-mask