plural-artifacts icon indicating copy to clipboard operation
plural-artifacts copied to clipboard

Implement PyTorch controller for multi node GPU scaling

Open jaystary opened this issue 3 years ago • 1 comments

Use Case

We want to utilize multi node GPU scaling for PyTorch for a benchmark / potential larger scale model training

Ideas of Implementation

Implement KF Training Operator which as an added benefit should unlock all the relevant frameworks as well. https://github.com/kubeflow/training-operator


Message from the maintainers:

Excited about this feature? Give it a :thumbsup:. We factor engagement into prioritization.

jaystary avatar Feb 10 '22 22:02 jaystary

The PyTorch operator resource can already be used in the current Kubeflow deployment, so there's no need to pause development while we update the deployment to the training operator.

davidspek avatar Feb 14 '22 13:02 davidspek