training-operator
training-operator copied to clipboard
KEP-2170: Create PyTorch multi-node distributed training runtime
Related: https://github.com/kubeflow/training-operator/issues/2170
We should create ClusterTrainingRuntime for PyTorch multi-node distributed training.
/area runtime
I'm learning training-operator v1, I want to work for this issue. Please give me some suggestions.
/assign