pytorch-operator MPI distributed training job failed on master node with message "MPI process group does not support multi-GPU collectives" but succeed on worker node

I am trying to deploy distributed MNIST training on EKS by MPI backend. However it seems like the master node does not work with message "MPI process group does not support multi-GPU collectives"

Here is the detailed output message from master node:

Using CUDA
Using distributed PyTorch with mpi backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz
9920512it [00:00, 28346258.66it/s]                             
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz
32768it [00:00, 787800.88it/s]
Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
1654784it [00:00, 11476008.77it/s]                         
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
8192it [00:00, 337022.08it/s]
Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Processing...
Done!
Traceback (most recent call last):
  File "/var/mnist.py", line 150, in <module>
    main()
  File "/var/mnist.py", line 143, in main
    train(args, model, device, train_loader, optimizer, epoch, writer)
  File "/var/mnist.py", line 42, in train
    loss.backward()
  File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: MPI process group does not support multi-GPU collectives
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[38209,1],0]
  Exit code:    1
--------------------------------------------------------------------------

In the meantime, worker node works smoothly and finished the job quickly.

Here are corresponding steps I am using to deploy the job

#/bin/bash

ks init pytorch-dist
cd pytorch-dist/
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow
ks pkg install kubeflow/pytorch-job
ks generate pytorch-operator pytorch-operator

# Apply the component
ks apply default -c pytorch-operator

# Apply the distributed training job
kubectl apply -f /tmp/pytorch_multi_node_training.yaml

And the corresponding pytorch_multi_node_training.yaml looks like it:

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: kubeflow-pytorch-gpu-dist-job
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - args: 
            - --backend
            - mpi 
            - --epochs
            - '10'
            image: ***
            name: pytorch
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        spec:
          containers:
          - args:
            - --backend
            - mpi
            - --epochs
            - '10' 
            image: ***
            name: pytorch
            resources:
              limits:
                nvidia.com/gpu: 1

The images I am using is something similar with the Dockerfile template available in this repo.

I build the image with PyTorch built from source. I also already installed openmpi in this image. The mpi backend is correctly enabled when sanity testing in python torch by:


In [1]: torch.distributed.is_mpi_available()                                                                                                                                                                                              
Out[1]: True

Aug 07 '19 20:08 YYStreet

@YYStreet Seems like this issue is not in the scope of Pytorch-operator. Can you post this in Pytorch forums?

Aug 29 '19 08:08 johnugeorge

/area engprod /kind feature /priority p2

Jan 14 '20 21:01 jtfogarty

pytorch-operator pytorch-operator copied to clipboard

MPI distributed training job failed on master node with message "MPI process group does not support multi-GPU collectives" but succeed on worker node

pytorch-operator
pytorch-operator copied to clipboard