pytorch-operator icon indicating copy to clipboard operation
pytorch-operator copied to clipboard

PyTorch on Kubernetes

Results 63 pytorch-operator issues
Sort by recently updated
recently updated
newest added

Hi Team, I have binary of pytorch-operator but gcloud is not supported on ppc64le platform. How can i build image in such scenario for pytorch-operator on ppc64le platform.

Hi, everyone. I want to test the failure tolerance of PytorchJob. I started a PytorchJob with 1 master and 3 workers. ```shell $ kubectl get pods -o wide NAME READY...

Hi Team, I am trying to run a Kubernetes Pod with multiple GPUs in the same pod. I can't seem to find any resources for how to do this. All...

Traceback (most recent call last): File "/home/sxl/PythonDev/k8s_client/pytorch_main.py", line 69, in pytorchjob_client.create(pytorchjob) File "/home/sxl/.local/lib/python3.8/site-packages/kubeflow/pytorchjob/api/py_torch_job_client.py", line 65, in create outputs = self.custom_api.create_namespaced_custom_object( File "/home/sxl/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 225, in create_namespaced_custom_object return self.create_namespaced_custom_object_with_http_info(group, version, namespace,...

Hi! Suppose in my cluster I have 2 nodes with 2 gpus each. What is the better practice for using all 4 gpus: 1. To spawn 4 pods with 1...

kind/question
area/engprod
priority/p2

If there is 2GPU per node, how to set the Worker spec In the PytorchJob 1 replicas with 2GPU per pod or 2 replicas with only 1GPU per pod? I've...

Service label is `app: pytorch-operator`, while selector is `name: pytorch-operator`. Deployment spec label and selector are both `name: pytorch-operator`. ![image](https://user-images.githubusercontent.com/18288851/140506943-00a4c917-2d89-4145-8e4c-e34dd548077d.png) In such a case, both the service and deployment have...

kind/bug

Hello! I'm setting up training with PyTorchJobs. I have the problem: if one of the pods (doesn't matter, master or worker) reloads, the whole process hangs. The reason for reloading...

kind/question
area/engprod

Just wondering how this operator handles being run on preemptible GCP instances and where I can find more documentation on the subject Thanks

I ran `mnist` example with 2 workers on a 2-node Kubernetes cluster running on 2 VMs, and expected it be faster comparing with 1-worker case. However the time actually increased,...

kind/bug