kubedl
kubedl copied to clipboard
[feature request] Define DataSet API to allow users to specify more options of data input for training
Currently, KubeDL workloads requires users to write the PVC and volumeMounts config to consume the source dataset. This also implies that users would need to put dataset into a PV first. This incurs a very heavy overhead -- especially for users who might not use the same infrastructure to produce the dataset. For example, users who did created the dataset locally might not know how to put the data to a PV in a Cloud managed k8s cluster.
To solve this problem and improve user experience, we should define a DataSet API to allow users to specify more options of data input for training -- S3 buckets, NAS Storage, HTTP file server, etc. KubeDL controllers should be able to handle the creation of PV and those k8s internals to store and transfer data around under the hood.
/assign @SimonCqk