pytorch-operator
pytorch-operator copied to clipboard
MPI distributed training job failed on master node with message "MPI process group does not support multi-GPU collectives" but succeed on worker node
I am trying to deploy distributed MNIST training on EKS by MPI backend. However it seems like the master node does not work with message "MPI process group does not support multi-GPU collectives"
Here is the detailed output message from master node:
Using CUDA
Using distributed PyTorch with mpi backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz to ../data/MNIST/raw/train-images-idx3-ubyte.gz
9920512it [00:00, 28346258.66it/s]
Extracting ../data/MNIST/raw/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz to ../data/MNIST/raw/train-labels-idx1-ubyte.gz
32768it [00:00, 787800.88it/s]
Extracting ../data/MNIST/raw/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz to ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
Extracting ../data/MNIST/raw/t10k-images-idx3-ubyte.gz
1654784it [00:00, 11476008.77it/s]
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz to ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
8192it [00:00, 337022.08it/s]
Extracting ../data/MNIST/raw/t10k-labels-idx1-ubyte.gz
Processing...
Done!
Traceback (most recent call last):
File "/var/mnist.py", line 150, in <module>
main()
File "/var/mnist.py", line 143, in main
train(args, model, device, train_loader, optimizer, epoch, writer)
File "/var/mnist.py", line 42, in train
loss.backward()
File "/opt/conda/lib/python3.6/site-packages/torch/tensor.py", line 107, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/opt/conda/lib/python3.6/site-packages/torch/autograd/__init__.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: MPI process group does not support multi-GPU collectives
--------------------------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun.real detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
Process name: [[38209,1],0]
Exit code: 1
--------------------------------------------------------------------------
In the meantime, worker node works smoothly and finished the job quickly.
Here are corresponding steps I am using to deploy the job
#/bin/bash
ks init pytorch-dist
cd pytorch-dist/
ks registry add kubeflow github.com/kubeflow/kubeflow/tree/master/kubeflow
ks pkg install kubeflow/pytorch-job
ks generate pytorch-operator pytorch-operator
# Apply the component
ks apply default -c pytorch-operator
# Apply the distributed training job
kubectl apply -f /tmp/pytorch_multi_node_training.yaml
And the corresponding pytorch_multi_node_training.yaml looks like it:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: kubeflow-pytorch-gpu-dist-job
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- args:
- --backend
- mpi
- --epochs
- '10'
image: ***
name: pytorch
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
containers:
- args:
- --backend
- mpi
- --epochs
- '10'
image: ***
name: pytorch
resources:
limits:
nvidia.com/gpu: 1
The images I am using is something similar with the Dockerfile template available in this repo.
I build the image with PyTorch built from source. I also already installed openmpi in this image. The mpi backend is correctly enabled when sanity testing in python torch by:
In [1]: torch.distributed.is_mpi_available()
Out[1]: True
@YYStreet Seems like this issue is not in the scope of Pytorch-operator. Can you post this in Pytorch forums?
/area engprod /kind feature /priority p2