pytorch-operator icon indicating copy to clipboard operation
pytorch-operator copied to clipboard

PyTorch on Kubernetes

Results 63 pytorch-operator issues
Sort by recently updated
recently updated
newest added

Hi, I'm trying to start distributed training by kubeflow pytorchjob. However, the `dist.init_process_group(backend="nccl", init_method='tcp://'+args.master_addr+':'+args.master_port, world_size=args.world_size, rank=args.rank) ` doesn't work. If I use os.environ["MASTER_PORT"] as the args.master_addr, it says > --...

I add the priority of kube-batch in pytorch-operator. I changed the code of tf-operator and kubebatch in vender dir,because their code is not latest.

size/M
ok-to-test

v1beta foldeer has been renamed to v1 so needs the path too: new command: kubectl create -f ./v1/pytorch_job_mnist_gloo.yaml Inconsistent documentation

size/XS
approved
lgtm
ok-to-test

While trying these examples, I found that the format of the image address is not correct, updated them.

size/S
ok-to-test

… provided pytorch Docker Image provided in PR #255 Updated the GPU compatible Docker builiding porcess with the Kubeflow provided pytorch Docker Image

size/XS
needs-ok-to-test

Docs changes should not trigger presubmit jobs. This help improve development efficiency and try to reduce testing infra cost.

area/engprod
kind/feature

[TorchElastic](https://pytorch.org/elastic/) enables distributed PyTorch training jobs to be executed in a fault tolerant and elastic manner. Use cases: - Fault Tolerance: jobs that run on infrastructure where nodes get replaced...

kind/feature

We should activate Travis check in presubmit and postsubmit. Also goveralls should report status for coveralls (See comment: https://github.com/kubeflow/pytorch-operator/pull/293#issuecomment-666045781) and we should migrate to `travis-ci.com` (See comment: https://github.com/kubeflow/pytorch-operator/pull/293#issuecomment-674088221). /cc @johnugeorge...

priority/p1
kind/feature

https://travis-ci.org/github/kubeflow/pytorch-operator/builds Our unit test cases are failed but we did not find it. It seems that Travis CI does not show status in PR page.

priority/p0
area/engprod
kind/bug

Hello, I am running kubernetes v1.15.7 and kubeflow 0.70 on a 6 workers node on-prem cluster. each node has 2 GPUs. The provided mnist.py works fine when running under the...

kind/bug