KEP-2170: Create PyTorch multi-node distributed training runtime

Open andreyvelich opened this issue 1 year ago • 1 comments

Related: https://github.com/kubeflow/training-operator/issues/2170

We should create ClusterTrainingRuntime for PyTorch multi-node distributed training.

/area runtime

Aug 14 '24 15:08 andreyvelich

I'm learning training-operator v1, I want to work for this issue. Please give me some suggestions.

Aug 28 '24 08:08 yang20150702

/assign

Oct 30 '24 07:10 deepanker13