clearml icon indicating copy to clipboard operation
clearml copied to clipboard

Enhancement Request: Improved ClearML Data Management for Child Datasets

Open Heegreis opened this issue 11 months ago • 1 comments

Proposal Summary

I've been using ClearML Data and encountered several issues with child datasets. Specifically:

  1. When renaming or changing the path of files that are the same, the "FILES CHANGED" log shows them as "Added 1" and "Removed 1". It would be more intuitive if they were recorded as "Renamed", similar to Git's behavior. Additionally, the fileserver retains duplicate files even after renaming, which could be addressed by linking files in child datasets to parent dataset files using their SHA identifiers. Here's the process I followed using ClearML Data to rename files in a child dataset: remove the file -> add the same file with a new name.

  2. If a file is removed and then the same file (with the same filename and path) is added back, the "FILES CHANGED" log registers it as "Modified 1". However, in essence, no actual changes were made to the dataset content. Furthermore, the fileserver stores identical files (same filename, path, and content) redundantly.

Motivation

By addressing these issues, I believe we can achieve better dataset state management and significantly reduce the fileserver's storage consumption.

Heegreis avatar Aug 12 '23 20:08 Heegreis

Thanks for proposing @Heegreis.

We'll look into how this can be addressed in future versions.

ainoam avatar Aug 13 '23 14:08 ainoam