pytorch-operator icon indicating copy to clipboard operation
pytorch-operator copied to clipboard

PyTorch on Kubernetes

Results 63 pytorch-operator issues
Sort by recently updated
recently updated
newest added

I started using the python sdk with the intent of making it into a kubeflow pipelines launcher, but noticed some mismatch between the pytorchjob sdk and kubernetes. Little stuff like:...

When the master is finished running, the worker is still initializing. ![1629019887(1)](https://user-images.githubusercontent.com/51263129/129474058-22d7190f-e8e6-4ff3-a623-ed5d1dbcd7cd.jpg) worker log: Error from server (BadRequest): container "pytorch" in pod"xxx-jxosi-worker-0" is waiting to start: PodInitializing What is the...

kind/bug

Signed-off-by: bert.li pytorch-operator will create podgroup when reconcile tfjob, but the queue of created podgroup is always "default". I want create podgroup with queue, like mpi-operator, so 1. create pytorchjob...

size/M
needs-ok-to-test

I try to figure out why Bare Metal (BM) and PytorchJob (PJ) have different training results in https://github.com/kubeflow/pytorch-operator/issues/354#issue-999999536. And now I find that PytorchJon v1.8.0 and 1.9.0 have different training...

Dear developers, I got a new problem. I've compared the DDP training process of PytorchJob (PJ) and Bare Metal (BM) and got different training results. ## Experiment settings - Two...

Hello, Dear developers. I encounter a question when using pytorchjob. Can PytorchJob skip or cancel the init cantainer?

E2e test is down. Reason is straightforwad that server report 503 issue and I did some check and notice this has been tracked in torch community. As the patch is...

volcano (from v0.4) change the PodGroup CRD APIGroup to `scheduling.volcano.sh` but when I create pytorchjob with gang-scheduling, it will create podgroup whose APIGroup is `scheduling.incubator.k8s.io`

Ref https://github.com/pytorch/elastic/issues/117