pytorch-operator icon indicating copy to clipboard operation
pytorch-operator copied to clipboard

PyTorch on Kubernetes

Results 63 pytorch-operator issues
Sort by recently updated
recently updated
newest added

Updated the docker image from pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime to kubeflow/pytorch:1.0-cuda10.0-cudnn7-runtime as the pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime is not GPU compatible.Hence using the Docker Image of PR #255.

size/XS
needs-ok-to-test

Added Pytorch Cuda Docker Image as the Image pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime in not having cuda.So the examples/mnist.py is not using GPU.The issue is with the pytorch image .The new docker image i...

size/M
needs-ok-to-test

https://github.com/kubeflow/pytorch-operator/blob/master/examples/ddp/mnist/gpu/v1alpha2/job_mnist_DDP_GPU.yaml When launching MPI backend jobs examples above with `ENTRYPOINT ["mpirun", "-n", "4", "--allow-run-as-root", "python", "-u", "/opt/pytorch_dist_mnist/mnist_ddp_gpu.py"]` in Dockerfile,I expected to do distributed training where it launched 1 process on...

priority/p2
area/operator
kind/feature

Hi Currently, when a new `PytorchJob` is created the Kubernetes API will only check if it matches the [schema defined in the `CRD`](https://github.com/kubeflow/pytorch-operator/blob/master/manifests/crd.yaml#L21), but the "real" validation is done in...

priority/p2
kind/feature
area/0.7.0

There are test image naming conflict: Both `pytorch-dist-mnist_test` and `pytorch-dist-mnist-test` are used. And Version conflict: Both `1.0` and `v1.0` are used. We should unify image naming.

area/engprod
priority/p2
kind/bug

I am trying to deploy distributed MNIST training on EKS by MPI backend. However it seems like the master node does not work with message "MPI process group does not...

area/engprod
priority/p2
kind/feature

## Requirements ### Configuration and deployment Description | Category | Status | Issue -- | -- | -- | -- Kustomize package | Required | Done |   Application CR |...

area/engprod
priority/p2
kind/feature

Related: #214

area/engprod
priority/p2
kind/feature

the pytorcg docker image pytorch/pytorch:1.0-cuda10.0-cudnn7-runtime used in examples/mnist Dockerfile cannot use GPU for mnist.py eaxmple always giving cuda as False. This is because the cuda is not installed properly kubectl...

Failed to set kubeflow in CI test. ``` level=error msg="validating registry URL: validating GitHub registry URL: \"https://github.com/kubeflow/kubeflow/tree/master/kubeflow/reg istry.yaml\" actual 404; expected 200" ``` That's caused by kubeflow/kubeflow#4484, the ksonnet registry...