pytorch-operator
pytorch-operator copied to clipboard
PyTorch on Kubernetes
kubeflow/common release a stable version 0.3.1 and we can migrate to use implementation of kubeflow/common. The change will be similar to change https://github.com/kubeflow/tf-operator/pull/1171. It would be better to resolve dependencies,...
https://github.com/kubeflow/pytorch-operator/blob/master/manifests/deployment.yaml#L22 I think we can remove it. /cc @johnugeorge WDYT, I think we do not use it
 why not set large timeout at `torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(0, 1800), world_size=-1, rank=-1, store=None, group_name='')`? What's the meaning of adding this?
Tensorflow and PyTorch uses branches rather than tags for dependency management. Since we may make some breaking changes in the repo. I would suggest to cut a release and tags...
Currently, the backoff retries in all replicas are controlled separately. Either directly restarted by pod controller (`restartPolicy` will be propagated from `ReplicaSpec` to `PodTemplateSpec`) https://github.com/kubeflow/pytorch-operator/blob/037cd1b18eb77f657f2a4bc8a8334f2a06324b57/pkg/controller.v1/pytorch/pod.go#L283-L289 or restarted in the pytorchjob...
I set up Kubeflow v0.6.0 on Microk8s v1.17. After executing kubectl create -f pytorch_job_mnist_gloo.yaml from the example I can see PytorchJob created, but no events on it happening and no...
The file directory changes from v1beta1 to v1. It needs to be fixed to v1 in README.md file ```bash kubectl create -f ./v1beta1/pytorch_job_mnist_gloo.yaml ```
`examples/smoke-dist/README.md` documents `pytorch_job_sendrecv.yaml` But it does not exists on `examples/smoke-dist`
Taking [this fix](https://github.com/kubeflow/common/issues/54) into account, now pytorch operator should [depend on](https://github.com/kubeflow/pytorch-operator/blob/ec39dce0f98136ed89668c14130347616b463da5/pkg/apis/pytorch/v1/types.go#L18) `github.com/kubeflow/common/pkg/apis/common/v1` instead of `github.com/kubeflow/common/job_controller/api/v1`
https://github.com/kubeflow/pytorch-operator/blob/047cf0f41e68e030158f532017a226c18827a660/pkg/controller.v1/pytorch/job.go#L160 we just ignore running policy for now