pytorch-operator
pytorch-operator copied to clipboard
Multi-gpu in a single pod
Hi Team,
I am trying to run a Kubernetes Pod with multiple GPUs in the same pod. I can't seem to find any resources for how to do this. All the resources I find are 1 pod = 1 gpu. I don't want this. I want to be able to spin up 2x4gpu (8gpu) pods or different combinations.
It seems this has been asked before in #219 #331 but no solid answers in there.
The YAML file I have based my testing on is from this tutorial: https://towardsdatascience.com/pytorch-distributed-on-kubernetes-71ed8b50a7ee
I have changed part of it to reflect using 2 GPUs in 1 pod.
Worker:
replicas: 1
restartPolicy: OnFailure
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
spec:
volumes:
- name: pv-k8s-storage
persistentVolumeClaim:
claimName: pvc-k8s-storage
containers:
- name: pytorch
command: ["/bin/sh"]
args: ["-c", "/usr/bin/python3 -m pip install --upgrade pip; pip install tensorboardX pandas scikit-learn; python3 ranzrc.py --epochs 5 --ba$
image: pytorch/pytorch:1.10.0-cuda11.3-cudnn8-runtime
resources:
requests:
nvidia.com/gpu: 2
limits:
nvidia.com/gpu: 2
I am seeing similar behaviour to #219 where when I spin this up, only 1 GPU gets used by the test code (when I told it to use 2).
Any assistance or pointing in the right direction on this would be great. Thanks!
Maybe you can have a look at what I do in this issue https://github.com/kubeflow/pytorch-operator/issues/354#issue-999999536.
Best wishes.
This repository will be deprecated soon, please open an issue at github.com/kubeflow/training-operator