pytorch-operator
pytorch-operator copied to clipboard
PyTorch on Kubernetes
Hi Team, I have binary of pytorch-operator but gcloud is not supported on ppc64le platform. How can i build image in such scenario for pytorch-operator on ppc64le platform.
Hi, everyone. I want to test the failure tolerance of PytorchJob. I started a PytorchJob with 1 master and 3 workers. ```shell $ kubectl get pods -o wide NAME READY...
Hi Team, I am trying to run a Kubernetes Pod with multiple GPUs in the same pod. I can't seem to find any resources for how to do this. All...
Traceback (most recent call last): File "/home/sxl/PythonDev/k8s_client/pytorch_main.py", line 69, in pytorchjob_client.create(pytorchjob) File "/home/sxl/.local/lib/python3.8/site-packages/kubeflow/pytorchjob/api/py_torch_job_client.py", line 65, in create outputs = self.custom_api.create_namespaced_custom_object( File "/home/sxl/.local/lib/python3.8/site-packages/kubernetes/client/api/custom_objects_api.py", line 225, in create_namespaced_custom_object return self.create_namespaced_custom_object_with_http_info(group, version, namespace,...
Hi! Suppose in my cluster I have 2 nodes with 2 gpus each. What is the better practice for using all 4 gpus: 1. To spawn 4 pods with 1...
If there is 2GPU per node, how to set the Worker spec In the PytorchJob 1 replicas with 2GPU per pod or 2 replicas with only 1GPU per pod? I've...
Service label is `app: pytorch-operator`, while selector is `name: pytorch-operator`. Deployment spec label and selector are both `name: pytorch-operator`.  In such a case, both the service and deployment have...
Hello! I'm setting up training with PyTorchJobs. I have the problem: if one of the pods (doesn't matter, master or worker) reloads, the whole process hangs. The reason for reloading...
Just wondering how this operator handles being run on preemptible GCP instances and where I can find more documentation on the subject Thanks
I ran `mnist` example with 2 workers on a 2-node Kubernetes cluster running on 2 VMs, and expected it be faster comparing with 1-worker case. However the time actually increased,...