examples
examples copied to clipboard
how to run pytorch mnist ddp
I have kubeflow deployed now, but there is a problem running the official mnist example, how should I solve it? The yml of PytorchJob is as follows:
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
name: pytorch-mnist-ddp-gpu
namespace: kubeflow-user-example-com
spec:
pytorchReplicaSpecs:
Master:
replicas: 1
restartPolicy: OnFailure
template:
spec:
containers:
- image: gcr.io/kubeflow-examples/pytorch-mnist-ddp-gpu
name: pytorch
resources:
limits:
cpu: '1'
memory: 4Gi
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /mnt/kubeflow-gcfs
name: kubeflow-gcfs
volumes:
- name: kubeflow-gcfs
persistentVolumeClaim:
claimName: kubeflow-gcfs
readOnly: false
Worker:
replicas: 2
restartPolicy: OnFailure
template:
spec:
containers:
- image: gcr.io/kubeflow-examples/pytorch-mnist-ddp-gpu
name: pytorch
resources:
limits:
cpu: '1'
memory: 4Gi
nvidia.com/gpu: 1
volumeMounts:
- mountPath: /mnt/kubeflow-gcfs
name: kubeflow-gcfs
volumes:
- name: kubeflow-gcfs
persistentVolumeClaim:
claimName: kubeflow-gcfs
readOnly: false