pytorch-operator icon indicating copy to clipboard operation
pytorch-operator copied to clipboard

PyTorch on Kubernetes

Results 63 pytorch-operator issues
Sort by recently updated
recently updated
newest added

The [current CRD](https://github.com/kubeflow/pytorch-operator/blob/a502590d8d340186604e695c55b4cc6cea5cee0d/manifests/base/crd.yaml#L1) is v1beta1 which is [deprecated](https://kubernetes.io/docs/reference/using-api/deprecation-guide/#customresourcedefinition-v122) and is no longer served as of v1.22.

Hello. Dear developers, I find a problem when using pytorchjob. ## Problem I notice that **PytorchJob replica pods don't obey the scheduling rules set in the node affinity. All the...

He Team, I am trying to use the Pytorch Operator to spawn distributed Pytorch Jobs. I see the image mentioned in https://github.com/kubeflow/pytorch-operator/blob/6293efc19503078953acf04df03a1204fd265e35/manifests/kustomization.yaml#L13 to be `809251082950.dkr.ecr.us-west-2.amazonaws.com/pytorch-operator`. However, that repo is not...

kind/bug

I get 403, if I can use this way, how should I setup the config file? Thanks ptc.get(namespace='kubeflow') Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/kubeflow/pytorchjob/api/py_torch_job_client.py", line 134, in get pytorchjob...

The QoS of worker pod created by operator is Burstable due to the resource config here: https://github.com/kubeflow/pytorch-operator/blob/4aeb6503162465766476519339d3285f75ffe03e/pkg/common/config/config.go#L9-L20 This is a vital issue because only the pods with Guaranteed class can...

During the PyTorch Job distributed learning, sometimes the 'Worker' cannot find the 'Master' with below message. ``` Traceback (most recent call last): File "/workspace/src/bert/benchmark.py", line 2248, in main() File "/workspace/src/bert/benchmark.py",...

kind/feature

## Overview Referring to the [Distributed MNIST example](https://github.com/kubeflow/pytorch-operator/tree/master/examples/mnist), I am running into an issue where the worker pods return "call to connect returned Connection refused" repeatedly before crashing with an...

When running the following yaml, ``` apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: my-pytorchjob namespace: my-namespace spec: activeDeadlineSeconds: -1 cleanPodPolicy: Running pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: metadata: annotations: sidecar.istio.io/inject:...

I'm completely new to kubeflow, but the main advantage of it from my perspective is the usage of pipelines to setup production ready end to end machine learning workflows. I'm...

kind/question
area/engprod
priority/p2

Inspired by https://github.com/kubeflow/pipelines/issues/4682 I created a script that will create a config file for depandabot so that it knows what directories to scan. It will scan the repository for files...

size/L
do-not-merge/hold