clearml icon indicating copy to clipboard operation
clearml copied to clipboard

Clearml-Datasets not referenced by ClearML experiments

Open jax79sg opened this issue 1 year ago • 4 comments

The spirit of data versioning is the ability to quickly and accurately know which version of data is used to produce which version of experiment models in ClearML. Today, this feature remains a very manual process where developers are expected to add as a configuration to the experiment. I thought this would severely hinder ClearML's position as an MLOps tool.

If ClearML datasets can be actively tracked, i thought it would expand ClearML's capacity to do the following;

  1. Using automagic to detect if ClearML-datasets was used to pull the data for model training experiments.
  2. Automatically determine the version of dataset that was used for the training, and making this information a fixture in all ClearML experiments.
  3. Upgrading of Datasets UI to include references to different experiments, fulfilling an important aspect of ML, that is Provenance.
  4. Implicitly encouraging developers to use ClearML-Data due to the above advantages.

jax79sg avatar Aug 01 '22 08:08 jax79sg

Hi @jax79sg,

So, the good thing is that we have something that should make life easier :)

When calling Dataset.get() if you'll add an "alias" parameter, the dataset will be automatically logged in the configuration section, and can even be overriden when cloning and running the experiment.

In a coming version (as well as with updating docs) we'll promote this feature more.

As for item #3, do you mean that in the dataset UI, you'll be able to see which tasks used this dataset? This is an interesting idea! Care to elaborate how you thought of doing it? Where would the data be shown? Also, what would be the use case in which I'd go to the dataset tab, and look for where it is used?

Thanks!

erezalg avatar Aug 01 '22 10:08 erezalg

Hi @erezalg , on item #3. Today you will see the evolution of a dataset via a pipeline display. If you are able to track which version of a dataset pipeline is used for which tasks (or model training experiments), you could potentially shot a plus (+) sign allowing people to view all the experiments thats running off this datasets. It would be better if we can go into the granularity of train/val/test sets.

jax79sg avatar Aug 02 '22 02:08 jax79sg

Hi @jax79sg,

Thanks! that's indeed a good option. Can you elaborate on looking at the granularity? You mean, not only to see which experiment uses the dataset, but also what portion of it (train \ val \ test)?

erezalg avatar Aug 02 '22 06:08 erezalg

Hi @jax79sg,

Thanks! that's indeed a good option. Can you elaborate on looking at the granularity? You mean, not only to see which experiment uses the dataset, but also what portion of it (train \ val \ test)?

Yes, that's right. Cos when someone is debugging his models, its not much of a use if he doesn't know which portion is train, vs test/val.

jax79sg avatar Aug 02 '22 09:08 jax79sg

Just wondering if someone is looking at this. Today, specifying alias is the only way to link experiments to datasets. However, only dataset id is given. And the UI doesn't make it easy to find that dataset. https://clearml.slack.com/archives/CTK20V944/p1674333921543359?thread_ts=1674332650.308819&cid=CTK20V944

jax79sg avatar Feb 16 '23 07:02 jax79sg