pytorch-operator issues

dist.init_process_group stuck

9

Hi, I'm trying to start distributed training by kubeflow pytorchjob. However, the `dist.init_process_group(backend="nccl", init_method='tcp://'+args.master_addr+':'+args.master_port, world_size=args.world_size, rank=args.rank) ` doesn't work. If I use os.environ["MASTER_PORT"] as the args.master_addr, it says > --...

ravenj73

use the priority of kube-batch

11

I add the priority of kube-batch in pytorch-operator. I changed the code of tf-operator and kubebatch in vender dir,because their code is not latest.

YesterdayxD

size/M

ok-to-test

v1beta foldeer has been renamed to v1 so needs the path too

7

v1beta foldeer has been renamed to v1 so needs the path too: new command: kubectl create -f ./v1/pytorch_job_mnist_gloo.yaml Inconsistent documentation

MATRIX4284

size/XS

approved

lgtm

ok-to-test

Updated the image name format for the gcr.io.

11

While trying these examples, I found that the format of the image address is not correct, updated them.

wuchen03

size/S

ok-to-test

Updated the GPU compatible Docker builiding porcess with the Kubeflow…

8

… provided pytorch Docker Image provided in PR #255 Updated the GPU compatible Docker builiding porcess with the Kubeflow provided pytorch Docker Image

MATRIX4284

size/XS

needs-ok-to-test

Do not trigger presubmit jobs for simple changes

1

Docs changes should not trigger presubmit jobs. This help improve development efficiency and try to reduce testing infra cost.

Jeffwan

area/engprod

kind/feature

Support Torch Elastic in pytorch operator

2

[TorchElastic](https://pytorch.org/elastic/) enables distributed PyTorch training jobs to be executed in a fault tolerant and elastic manner. Use cases: - Fault Tolerance: jobs that run on infrastructure where nodes get replaced...

Jeffwan

kind/feature

Activate Travis in PR check

2

We should activate Travis check in presubmit and postsubmit. Also goveralls should report status for coveralls (See comment: https://github.com/kubeflow/pytorch-operator/pull/293#issuecomment-666045781) and we should migrate to `travis-ci.com` (See comment: https://github.com/kubeflow/pytorch-operator/pull/293#issuecomment-674088221). /cc @johnugeorge...

andreyvelich

priority/p1

kind/feature

[bug] Unit test is broken

4

https://travis-ci.org/github/kubeflow/pytorch-operator/builds Our unit test cases are failed but we did not find it. It seems that Travis CI does not show status in PR page.

gaocegege

priority/p0

area/engprod

kind/bug

PyTorchJob worker pods crashloops in non-default namespace

7

Hello, I am running kubernetes v1.15.7 and kubeflow 0.70 on a 6 workers node on-prem cluster. each node has 2 GPUs. The provided mnist.py works fine when running under the...

jobvarkey

kind/bug

pytorch-operator
pytorch-operator copied to clipboard

Metadata

dist.init_process_group stuck

use the priority of kube-batch

v1beta foldeer has been renamed to v1 so needs the path too

Updated the image name format for the gcr.io.

Updated the GPU compatible Docker builiding porcess with the Kubeflow…

Do not trigger presubmit jobs for simple changes

Support Torch Elastic in pytorch operator

Activate Travis in PR check

[bug] Unit test is broken

PyTorchJob worker pods crashloops in non-default namespace

← Metadata

Owner

Metadata

pytorch-operator pytorch-operator copied to clipboard

Metadata

← Metadata

Owner

Metadata

pytorch-operator
pytorch-operator copied to clipboard