pytorch-operator
pytorch-operator copied to clipboard
PyTorch on Kubernetes
The [current CRD](https://github.com/kubeflow/pytorch-operator/blob/a502590d8d340186604e695c55b4cc6cea5cee0d/manifests/base/crd.yaml#L1) is v1beta1 which is [deprecated](https://kubernetes.io/docs/reference/using-api/deprecation-guide/#customresourcedefinition-v122) and is no longer served as of v1.22.
Hello. Dear developers, I find a problem when using pytorchjob. ## Problem I notice that **PytorchJob replica pods don't obey the scheduling rules set in the node affinity. All the...
He Team, I am trying to use the Pytorch Operator to spawn distributed Pytorch Jobs. I see the image mentioned in https://github.com/kubeflow/pytorch-operator/blob/6293efc19503078953acf04df03a1204fd265e35/manifests/kustomization.yaml#L13 to be `809251082950.dkr.ecr.us-west-2.amazonaws.com/pytorch-operator`. However, that repo is not...
I get 403, if I can use this way, how should I setup the config file? Thanks ptc.get(namespace='kubeflow') Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/kubeflow/pytorchjob/api/py_torch_job_client.py", line 134, in get pytorchjob...
The QoS of worker pod created by operator is Burstable due to the resource config here: https://github.com/kubeflow/pytorch-operator/blob/4aeb6503162465766476519339d3285f75ffe03e/pkg/common/config/config.go#L9-L20 This is a vital issue because only the pods with Guaranteed class can...
During the PyTorch Job distributed learning, sometimes the 'Worker' cannot find the 'Master' with below message. ``` Traceback (most recent call last): File "/workspace/src/bert/benchmark.py", line 2248, in main() File "/workspace/src/bert/benchmark.py",...
## Overview Referring to the [Distributed MNIST example](https://github.com/kubeflow/pytorch-operator/tree/master/examples/mnist), I am running into an issue where the worker pods return "call to connect returned Connection refused" repeatedly before crashing with an...
When running the following yaml, ``` apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: my-pytorchjob namespace: my-namespace spec: activeDeadlineSeconds: -1 cleanPodPolicy: Running pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: metadata: annotations: sidecar.istio.io/inject:...
I'm completely new to kubeflow, but the main advantage of it from my perspective is the usage of pipelines to setup production ready end to end machine learning workflows. I'm...
Inspired by https://github.com/kubeflow/pipelines/issues/4682 I created a script that will create a config file for depandabot so that it knows what directories to scan. It will scan the repository for files...