pytorch-operator Multi-gpu in a single pod

Hi Team,

I am trying to run a Kubernetes Pod with multiple GPUs in the same pod. I can't seem to find any resources for how to do this. All the resources I find are 1 pod = 1 gpu. I don't want this. I want to be able to spin up 2x4gpu (8gpu) pods or different combinations.

It seems this has been asked before in #219 #331 but no solid answers in there.

The YAML file I have based my testing on is from this tutorial: https://towardsdatascience.com/pytorch-distributed-on-kubernetes-71ed8b50a7ee

I have changed part of it to reflect using 2 GPUs in 1 pod.

 Worker:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          volumes:
            - name: pv-k8s-storage
              persistentVolumeClaim:
                claimName: pvc-k8s-storage
          containers:
            - name: pytorch
              command: ["/bin/sh"]
              args: ["-c", "/usr/bin/python3 -m pip install --upgrade pip; pip install tensorboardX pandas scikit-learn; python3 ranzrc.py --epochs 5 --ba$
              image: pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime
              resources:
                requests:
                  nvidia.com/gpu: 2
                limits:
                  nvidia.com/gpu: 2

I am seeing similar behaviour to #219 where when I spin this up, only 1 GPU gets used by the test code (when I told it to use 2).

Any assistance or pointing in the right direction on this would be great. Thanks!

Nov 19 '21 12:11 wallarug

Maybe you can have a look at what I do in this issue https://github.com/kubeflow/pytorch-operator/issues/354#issue-999999536.

Best wishes.

Nov 20 '21 15:11 Shuai-Xie

This repository will be deprecated soon, please open an issue at github.com/kubeflow/training-operator

Nov 21 '21 01:11 gaocegege

pytorch-operator pytorch-operator copied to clipboard

Multi-gpu in a single pod

pytorch-operator
pytorch-operator copied to clipboard