training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

KEP-2170: Create PyTorch multi-node distributed training runtime

Open andreyvelich opened this issue 1 year ago • 1 comments

Related: https://github.com/kubeflow/training-operator/issues/2170

We should create ClusterTrainingRuntime for PyTorch multi-node distributed training.

/area runtime

andreyvelich avatar Aug 14 '24 15:08 andreyvelich

I'm learning training-operator v1, I want to work for this issue. Please give me some suggestions.

yang20150702 avatar Aug 28 '24 08:08 yang20150702

/assign

deepanker13 avatar Oct 30 '24 07:10 deepanker13