pytorch-operator issues

unable to build image for ppc64le

Hi Team, I have binary of pytorch-operator but gcloud is not supported on ppc64le platform. How can i build image in such scenario for pytorch-operator on ppc64le platform.

gajanankulkarni-18

PytorchJob DDP training will stop if I delete a worker pod

2

Hi, everyone. I want to test the failure tolerance of PytorchJob. I started a PytorchJob with 1 master and 3 workers. ```shell $ kubectl get pods -o wide NAME READY...

Shuai-Xie

Multi-gpu in a single pod

2

Hi Team, I am trying to run a Kubernetes Pod with multiple GPUs in the same pod. I can't seem to find any resources for how to do this. All...

wallarug

run https://github.com/kubeflow/pytorch-operator/blob/master/sdk/python/test/test_e2e.py failed

1

Traceback (most recent call last): File "/home/sxl/PythonDev/k8s_client/pytorch_main.py", line 69, in pytorchjob_client.create(pytorchjob) File "/home/sxl/.local/lib/python3.8/site-packages/kubeflow/pytorchjob/api/py_torch_job_client.py", line 65, in create outputs = self.custom_api.create_namespaced_custom_object( File "/home/sxl/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 225, in create_namespaced_custom_object return self.create_namespaced_custom_object_with_http_info(group, version, namespace,...

sxl1993

Right way to use pytorch-operator for multi-node multi-gpu setup

13

Hi! Suppose in my cluster I have 2 nodes with 2 gpus each. What is the better practice for using all 4 gpus: 1. To spawn 4 pods with 1...

lainisourgod

kind/question

area/engprod

priority/p2

whether multi-gpu-per-pod setup be supported in PytorchJob

1

If there is 2GPU per node, how to set the Worker spec In the PytorchJob 1 replicas with 2GPU per pod or 2 replicas with only 1GPU per pod? I've...

tingweiwu

service label mismatches selector, which result in inconsistency

3

Service label is `app: pytorch-operator`, while selector is `name: pytorch-operator`. Deployment spec label and selector are both `name: pytorch-operator`. ![image](https://user-images.githubusercontent.com/18288851/140506943-00a4c917-2d89-4145-8e4c-e34dd548077d.png) In such a case, both the service and deployment have...

konnase

kind/bug

The training hangs after reloading one of master/worker pods

5

Hello! I'm setting up training with PyTorchJobs. I have the problem: if one of the pods (doesn't matter, master or worker) reloads, the whole process hangs. The reason for reloading...

dmitsf

kind/question

area/engprod

GCP preemptible instances

4

Just wondering how this operator handles being run on preemptible GCP instances and where I can find more documentation on the subject Thanks

Nintorac

Distributed mnist is unexpectedly slow

7

I ran `mnist` example with 2 workers on a 2-node Kubernetes cluster running on 2 VMs, and expected it be faster comparing with 1-worker case. However the time actually increased,...

panchul

kind/bug

pytorch-operator
pytorch-operator copied to clipboard

Metadata

unable to build image for ppc64le

PytorchJob DDP training will stop if I delete a worker pod

Multi-gpu in a single pod

run https://github.com/kubeflow/pytorch-operator/blob/master/sdk/python/test/test_e2e.py failed

Right way to use pytorch-operator for multi-node multi-gpu setup

whether multi-gpu-per-pod setup be supported in PytorchJob

service label mismatches selector, which result in inconsistency

The training hangs after reloading one of master/worker pods

GCP preemptible instances

Distributed mnist is unexpectedly slow

← Metadata

Owner

Metadata

pytorch-operator pytorch-operator copied to clipboard

Metadata

← Metadata

Owner

Metadata

pytorch-operator
pytorch-operator copied to clipboard