clearml
clearml copied to clipboard
[Feature Request] Tag external datasets
Sometimes datasets can not be added to clearml because they need to stay on an external source. However, one can add a reference to a clearml dataset (e.g. a link) and use this clearml dataset as a proxy.
Request: Add a special tag to such datasets to show that it references an external dataset and immutability is not guaranteed.
Reference slack thread: https://clearml.slack.com/archives/CTK20V944/p1643204024027749
Hey @mctigger Just want to make sure I got everything from the slack channel and your message above:
We basically want to add a functionality to clearml-data where instead of uploading files to the fileserver, we'll upload a file containing links to the files you'd like clearml-data to version. On the other side, when you want to get_local_copy() of a dataset, clearml-data will look at the links and download them.
A few pointers:
- interface should probably be like this: clearml-data add --link s3://path_to_file.py s3://path_to_folder/
- supported storage mediums would be what we currently support in our storage-manager (s3, gs, azure, local, minio)
- Objects are mutable so it's up to the user to ensure data does not change
- We will tag dataset tasks containing "links" instead of actual files.
Makes sense? Did I forget anything?
Sounds good to me!
Hi @mctigger, :smiley: clearml 1.4.0 is now out supporting links in clearml-data! Let us know if it works as expected!
Hi @mctigger, closing this. Please re-open if required.