pytorch-operator issues

Upgrade to v1 CRDs

1

The [current CRD](https://github.com/kubeflow/pytorch-operator/blob/a502590d8d340186604e695c55b4cc6cea5cee0d/manifests/base/crd.yaml#L1) is v1beta1 which is [deprecated](https://kubernetes.io/docs/reference/using-api/deprecation-guide/#customresourcedefinition-v122) and is no longer served as of v1.22.

mcristina422

PytorchJob replicas has different node affinity behaviors compared with Deployment

4

Hello. Dear developers, I find a problem when using pytorchjob. ## Problem I notice that **PytorchJob replica pods don't obey the scheduling rules set in the node affinity. All the...

Shuai-Xie

Unlable to spawn PyTorchJob due to image alpine dependency of pytorch-operator

4

He Team, I am trying to use the Pytorch Operator to spawn distributed Pytorch Jobs. I see the image mentioned in https://github.com/kubeflow/pytorch-operator/blob/6293efc19503078953acf04df03a1204fd265e35/manifests/kustomization.yaml#L13 to be `809251082950.dkr.ecr.us-west-2.amazonaws.com/pytorch-operator`. However, that repo is not...

asahalyft

kind/bug

can I use PyTorchJobClient inside a pod of the cluster?

1

I get 403, if I can use this way, how should I setup the config file? Thanks ptc.get(namespace='kubeflow') Traceback (most recent call last): File "/usr/local/lib/python3.7/site-packages/kubeflow/pytorchjob/api/py_torch_job_client.py", line 134, in get pytorchjob...

omlomloml

Worker template should be configurable.

1

The QoS of worker pod created by operator is Burstable due to the resource config here: https://github.com/kubeflow/pytorch-operator/blob/4aeb6503162465766476519339d3285f75ffe03e/pkg/common/config/config.go#L9-L20 This is a vital issue because only the pods with Guaranteed class can...

MartinForReal

'host not found' error occurs during PyTorch distributed learning

1

During the PyTorch Job distributed learning, sometimes the 'Worker' cannot find the 'Master' with below message. ``` Traceback (most recent call last): File "/workspace/src/bert/benchmark.py", line 2248, in main() File "/workspace/src/bert/benchmark.py",...

JGoo1

kind/feature

NCCL "Connection Refused" for Worker Pods

1

## Overview Referring to the [Distributed MNIST example](https://github.com/kubeflow/pytorch-operator/tree/master/examples/mnist), I am running into an issue where the worker pods return "call to connect returned Connection refused" repeatedly before crashing with an...

twolffpiggott

Operator has invalid memory address error on specific pytorchjob spec

1

When running the following yaml, ``` apiVersion: kubeflow.org/v1 kind: PyTorchJob metadata: name: my-pytorchjob namespace: my-namespace spec: activeDeadlineSeconds: -1 cleanPodPolicy: Running pytorchReplicaSpecs: Master: replicas: 1 restartPolicy: OnFailure template: metadata: annotations: sidecar.istio.io/inject:...

ca-scribner

Integration into kubeflow pipeline

7

I'm completely new to kubeflow, but the main advantage of it from my perspective is the usage of pipelines to setup production ready end to end machine learning workflows. I'm...

miguelvr

kind/question

area/engprod

priority/p2

add dependabot config script

4

Inspired by https://github.com/kubeflow/pipelines/issues/4682 I created a script that will create a config file for depandabot so that it knows what directories to scan. It will scan the repository for files...

davidspek

size/L

do-not-merge/hold

pytorch-operator
pytorch-operator copied to clipboard

Metadata

Upgrade to v1 CRDs

PytorchJob replicas has different node affinity behaviors compared with Deployment

Unlable to spawn PyTorchJob due to image alpine dependency of pytorch-operator

can I use PyTorchJobClient inside a pod of the cluster?

Worker template should be configurable.

'host not found' error occurs during PyTorch distributed learning

NCCL "Connection Refused" for Worker Pods

Operator has invalid memory address error on specific pytorchjob spec

Integration into kubeflow pipeline

add dependabot config script

← Metadata

Owner

Metadata

pytorch-operator pytorch-operator copied to clipboard

Metadata

← Metadata

Owner

Metadata

pytorch-operator
pytorch-operator copied to clipboard