pytorch-operator
pytorch-operator copied to clipboard
PyTorch on Kubernetes
Updated the docker image from pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime to kubeflow/pytorch:1.0-cuda10.0-cudnn7-runtime as the pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime is not GPU compatible.Hence using the Docker Image of PR #255.
Added Pytorch Cuda Docker Image as the Image pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime in not having cuda.So the examples/mnist.py is not using GPU.The issue is with the pytorch image .The new docker image i...
https://github.com/kubeflow/pytorch-operator/blob/master/examples/ddp/mnist/gpu/v1alpha2/job_mnist_DDP_GPU.yaml When launching MPI backend jobs examples above with `ENTRYPOINT ["mpirun", "-n", "4", "--allow-run-as-root", "python", "-u", "/opt/pytorch_dist_mnist/mnist_ddp_gpu.py"]` in Dockerfile,I expected to do distributed training where it launched 1 process on...
Hi Currently, when a new `PytorchJob` is created the Kubernetes API will only check if it matches the [schema defined in the `CRD`](https://github.com/kubeflow/pytorch-operator/blob/master/manifests/crd.yaml#L21), but the "real" validation is done in...
There are test image naming conflict: Both `pytorch-dist-mnist_test` and `pytorch-dist-mnist-test` are used. And Version conflict: Both `1.0` and `v1.0` are used. We should unify image naming.
I am trying to deploy distributed MNIST training on EKS by MPI backend. However it seems like the master node does not work with message "MPI process group does not...
## Requirements ### Configuration and deployment Description | Category | Status | Issue -- | -- | -- | -- Kustomize package | Required | Done | Application CR |...
Related: #214
the pytorcg docker image pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime used in examples/mnist Dockerfile cannot use GPU for mnist.py eaxmple always giving cuda as False. This is because the cuda is not installed properly kubectl...
Failed to set kubeflow in CI test. ``` level=error msg="validating registry URL: validating GitHub registry URL: \"https://github.com/kubeflow/kubeflow/tree/master/kubeflow/reg istry.yaml\" actual 404; expected 200" ``` That's caused by kubeflow/kubeflow#4484, the ksonnet registry...