plural-artifacts
plural-artifacts copied to clipboard
Implement PyTorch controller for multi node GPU scaling
Use Case
We want to utilize multi node GPU scaling for PyTorch for a benchmark / potential larger scale model training
Ideas of Implementation
Implement KF Training Operator which as an added benefit should unlock all the relevant frameworks as well. https://github.com/kubeflow/training-operator
Message from the maintainers:
Excited about this feature? Give it a :thumbsup:. We factor engagement into prioritization.
The PyTorch operator resource can already be used in the current Kubeflow deployment, so there's no need to pause development while we update the deployment to the training operator.