clearml icon indicating copy to clipboard operation
clearml copied to clipboard

[Feature Request] Tag external datasets

Open mctigger opened this issue 3 years ago • 3 comments
trafficstars

Sometimes datasets can not be added to clearml because they need to stay on an external source. However, one can add a reference to a clearml dataset (e.g. a link) and use this clearml dataset as a proxy.

Request: Add a special tag to such datasets to show that it references an external dataset and immutability is not guaranteed.

Reference slack thread: https://clearml.slack.com/archives/CTK20V944/p1643204024027749

mctigger avatar Feb 18 '22 11:02 mctigger

Hey @mctigger Just want to make sure I got everything from the slack channel and your message above:

We basically want to add a functionality to clearml-data where instead of uploading files to the fileserver, we'll upload a file containing links to the files you'd like clearml-data to version. On the other side, when you want to get_local_copy() of a dataset, clearml-data will look at the links and download them.

A few pointers:

  1. interface should probably be like this: clearml-data add --link s3://path_to_file.py s3://path_to_folder/
  2. supported storage mediums would be what we currently support in our storage-manager (s3, gs, azure, local, minio)
  3. Objects are mutable so it's up to the user to ensure data does not change
  4. We will tag dataset tasks containing "links" instead of actual files.

Makes sense? Did I forget anything?

erezalg avatar Feb 27 '22 12:02 erezalg

Sounds good to me!

mctigger avatar Feb 28 '22 14:02 mctigger

Hi @mctigger, :smiley: clearml 1.4.0 is now out supporting links in clearml-data! Let us know if it works as expected!

erezalg avatar May 05 '22 17:05 erezalg

Hi @mctigger, closing this. Please re-open if required.

jkhenning avatar Sep 12 '22 06:09 jkhenning