training-operator PVC creation as part of PyTorch job spec

To avoid giving unnecessary permission to Kubeflow User and make it more general for PyTorchJob APIs. We need to introduce changes to API to be able to set the following in PyTorchJob:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorchjob-yaml
  namespace: notebooks-test
spec:
  pytorchReplicaSpecs:
     .....
  storageSpec:
     storageClassName: my-sc
     resources:
        request:
           storage: 8Gi

So the parameter looks like this:

type PyTorchJobSpec struct {
  storageSpec *corev1.PersistentVolumeClaimSpec `json:"storageSpec,omitempty"`

Dec 21 '23 18:12 deepanker13

I don't think we should introduce the original API since such changes will increase maintenance costs. So I would suggest using K8s core API's volumeClaimTemplates like StatefulSet:

https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#volume-claim-templates

cc: @andreyvelich @johnugeorge

Dec 21 '23 21:12 tenzen-y

@tenzen-y So your suggestion is to use volumeClaimTemplates in the PyTorchJob .spec to orchestrate PVC as part of our Training Operator controller loop as discussed here: https://github.com/kubeflow/training-operator/pull/1962#discussion_r1428168453 ? Do we have a use-case when we require to create more than one PVC for our Distributed Training Job ?

Jan 04 '24 16:01 andreyvelich

So your suggestion is to use volumeClaimTemplates in the PyTorchJob .spec to orchestrate PVC as part of our Training Operator controller loop as discussed here: https://github.com/kubeflow/training-operator/pull/1962#discussion_r1428168453 ?

Yes, that's right.

Do we have a use-case when we require to create more than one PVC for our Distributed Training Job ?

@andreyvelich We can imagine that users want to create volumes with different storageclasses. There is a case in which we want to create volumes with huge and slower storage for downloading large datasets and to create volumes with small and faster storage for putting uncompressed data.

Jan 25 '24 18:01 tenzen-y

There is a case in which we want to create volumes with huge and slower storage for downloading large datasets and to create volumes with small and faster storage for putting uncompressed data.

@tenzen-y In that case, user basically does some pre-processing on the PyTorch Master Pod to prepare uncompressed data for workers and distribute this data using small and faster storage ?

Jan 25 '24 20:01 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 25 '24 00:04 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

May 15 '24 00:05 github-actions[bot]

/reopen

May 15 '24 05:05 deepanker13

@deepanker13: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

May 15 '24 05:05 google-oss-prow[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Aug 13 '24 10:08 github-actions[bot]

This will be implemented in Kubeflow V2 APIs as part of this: https://github.com/kubernetes-sigs/jobset/issues/572.

Aug 13 '24 13:08 andreyvelich