pytorch-operator
pytorch-operator copied to clipboard
PyTorch on Kubernetes
Hi, I'm trying to start distributed training by kubeflow pytorchjob. However, the `dist.init_process_group(backend="nccl", init_method='tcp://'+args.master_addr+':'+args.master_port, world_size=args.world_size, rank=args.rank) ` doesn't work. If I use os.environ["MASTER_PORT"] as the args.master_addr, it says > --...
I add the priority of kube-batch in pytorch-operator. I changed the code of tf-operator and kubebatch in vender dir,because their code is not latest.
v1beta foldeer has been renamed to v1 so needs the path too: new command: kubectl create -f ./v1/pytorch_job_mnist_gloo.yaml Inconsistent documentation
While trying these examples, I found that the format of the image address is not correct, updated them.
… provided pytorch Docker Image provided in PR #255 Updated the GPU compatible Docker builiding porcess with the Kubeflow provided pytorch Docker Image
Docs changes should not trigger presubmit jobs. This help improve development efficiency and try to reduce testing infra cost.
[TorchElastic](https://pytorch.org/elastic/) enables distributed PyTorch training jobs to be executed in a fault tolerant and elastic manner. Use cases: - Fault Tolerance: jobs that run on infrastructure where nodes get replaced...
We should activate Travis check in presubmit and postsubmit. Also goveralls should report status for coveralls (See comment: https://github.com/kubeflow/pytorch-operator/pull/293#issuecomment-666045781) and we should migrate to `travis-ci.com` (See comment: https://github.com/kubeflow/pytorch-operator/pull/293#issuecomment-674088221). /cc @johnugeorge...
https://travis-ci.org/github/kubeflow/pytorch-operator/builds Our unit test cases are failed but we did not find it. It seems that Travis CI does not show status in PR page.
Hello, I am running kubernetes v1.15.7 and kubeflow 0.70 on a 6 workers node on-prem cluster. each node has 2 GPUs. The provided mnist.py works fine when running under the...