PVC creation as part of PyTorch job spec
To avoid giving unnecessary permission to Kubeflow User and make it more general for PyTorchJob APIs. We need to introduce changes to API to be able to set the following in PyTorchJob:
apiVersion: "kubeflow.org/v1"
kind: PyTorchJob
metadata:
name: pytorchjob-yaml
namespace: notebooks-test
spec:
pytorchReplicaSpecs:
.....
storageSpec:
storageClassName: my-sc
resources:
request:
storage: 8Gi
So the parameter looks like this:
type PyTorchJobSpec struct {
storageSpec *corev1.PersistentVolumeClaimSpec `json:"storageSpec,omitempty"`
I don't think we should introduce the original API since such changes will increase maintenance costs.
So I would suggest using K8s core API's volumeClaimTemplates like StatefulSet:
https://kubernetes.io/docs/concepts/workloads/controllers/statefulset/#volume-claim-templates
cc: @andreyvelich @johnugeorge
@tenzen-y So your suggestion is to use volumeClaimTemplates in the PyTorchJob .spec to orchestrate PVC as part of our Training Operator controller loop as discussed here: https://github.com/kubeflow/training-operator/pull/1962#discussion_r1428168453 ?
Do we have a use-case when we require to create more than one PVC for our Distributed Training Job ?
So your suggestion is to use volumeClaimTemplates in the PyTorchJob .spec to orchestrate PVC as part of our Training Operator controller loop as discussed here: https://github.com/kubeflow/training-operator/pull/1962#discussion_r1428168453 ?
Yes, that's right.
Do we have a use-case when we require to create more than one PVC for our Distributed Training Job ?
@andreyvelich We can imagine that users want to create volumes with different storageclasses. There is a case in which we want to create volumes with huge and slower storage for downloading large datasets and to create volumes with small and faster storage for putting uncompressed data.
There is a case in which we want to create volumes with huge and slower storage for downloading large datasets and to create volumes with small and faster storage for putting uncompressed data.
@tenzen-y In that case, user basically does some pre-processing on the PyTorch Master Pod to prepare uncompressed data for workers and distribute this data using small and faster storage ?
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed because it has not had recent activity. Please comment "/reopen" to reopen it.
/reopen
@deepanker13: Reopened this issue.
In response to this:
/reopen
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This will be implemented in Kubeflow V2 APIs as part of this: https://github.com/kubernetes-sigs/jobset/issues/572.