training-operator icon indicating copy to clipboard operation
training-operator copied to clipboard

PVC creation as part of PyTorch job spec

Open deepanker13 opened this issue 2 years ago • 8 comments

To avoid giving unnecessary permission to Kubeflow User and make it more general for PyTorchJob APIs. We need to introduce changes to API to be able to set the following in PyTorchJob:

apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
  name: pytorchjob-yaml
  namespace: notebooks-test
spec:
  pytorchReplicaSpecs:
     .....
  storageSpec:
     storageClassName: my-sc
     resources:
        request:
           storage: 8Gi

So the parameter looks like this:

type PyTorchJobSpec struct {
  storageSpec *corev1.PersistentVolumeClaimSpec `json:"storageSpec,omitempty"`

deepanker13 avatar Dec 21 '23 18:12 deepanker13

I don't think we should introduce the original API since such changes will increase maintenance costs. So I would suggest using K8s core API's volumeClaimTemplates like StatefulSet:

https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#volume-claim-templates

cc: @andreyvelich @johnugeorge

tenzen-y avatar Dec 21 '23 21:12 tenzen-y

@tenzen-y So your suggestion is to use volumeClaimTemplates in the PyTorchJob .spec to orchestrate PVC as part of our Training Operator controller loop as discussed here: https://github.com/kubeflow/training-operator/pull/1962#discussion_r1428168453 ? Do we have a use-case when we require to create more than one PVC for our Distributed Training Job ?

andreyvelich avatar Jan 04 '24 16:01 andreyvelich

So your suggestion is to use volumeClaimTemplates in the PyTorchJob .spec to orchestrate PVC as part of our Training Operator controller loop as discussed here: https://github.com/kubeflow/training-operator/pull/1962#discussion_r1428168453 ?

Yes, that's right.

Do we have a use-case when we require to create more than one PVC for our Distributed Training Job ?

@andreyvelich We can imagine that users want to create volumes with different storageclasses. There is a case in which we want to create volumes with huge and slower storage for downloading large datasets and to create volumes with small and faster storage for putting uncompressed data.

tenzen-y avatar Jan 25 '24 18:01 tenzen-y

There is a case in which we want to create volumes with huge and slower storage for downloading large datasets and to create volumes with small and faster storage for putting uncompressed data.

@tenzen-y In that case, user basically does some pre-processing on the PyTorch Master Pod to prepare uncompressed data for workers and distribute this data using small and faster storage ?

andreyvelich avatar Jan 25 '24 20:01 andreyvelich

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Apr 25 '24 00:04 github-actions[bot]

This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.

github-actions[bot] avatar May 15 '24 00:05 github-actions[bot]

/reopen

deepanker13 avatar May 15 '24 05:05 deepanker13

@deepanker13: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

google-oss-prow[bot] avatar May 15 '24 05:05 google-oss-prow[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Aug 13 '24 10:08 github-actions[bot]

This will be implemented in Kubeflow V2 APIs as part of this: https://github.com/kubernetes-sigs/jobset/issues/572.

andreyvelich avatar Aug 13 '24 13:08 andreyvelich