datumaro icon indicating copy to clipboard operation
datumaro copied to clipboard

Support dataset versioning

Open zhiltsov-max opened this issue 3 years ago • 1 comments

After releasing a dataset, it is often needs to be updated and maintained. In this process, or even in the process of experimenting and building the first version, it is very desirable to be able to reproduce any previous state of a dataset. Some solutions already provide similar functionality, so they might be helpful in implementation: DVC.

Requirements:

  • Revision tagging
  • Version navigation (undo/redo for a project)
    • Ability to reproduce (build, retrieve) any previous version of a project
  • Tracking of local changes
    • Local caching for intermediate versions
  • Version comparisons
    • Generation of patches
    • Version checking
  • Synchronization with remote storages
    • Checking for updates
    • Reading
    • Writing

zhiltsov-max avatar Mar 04 '21 10:03 zhiltsov-max

In the first iteration (#238) the following features are added:

General:

  • Added an option to record dataset operations and navigate over them
    • A project now contains several datasets, which can be operated on independently
    • By default, dataset modifications are recorded and done inplace
    • Operations can be applied to a specific dataset, or to the combined dataset (the "project")
  • Implemented a Git-like navigation over project states (revisions)
  • Added working tree and revision cache
  • Added data deduplication between revisions

Glossary:

  • Added new versioning concepts:
    • revision - corresponds to a Git commit
    • working / head / revision tree - a project build tree and plugins at a specified revision
    • object - a revision or a dataset
  • Added new dataset path concepts:
    • source / dataset / revision / project paths - a path to a dataset in a special format
      • [project local] rev[ision] paths - a way to specify the path to a source revision in the CLI, the syntax is: <revision>:<source/target name>. Any part can be missing. A revision is a commit hash or a named reference from Git (branch, tag, HEAD~3 etc.).
      • full revpaths - a way to specify path to a dataset of a source revision in a project, the syntax is:
        • <dataset path>:<format>, format is optional
        • <project path>@<revision>:<target name>, any part can be missing. Default project is the current project (-p CLI arg.), default revision is the working directory of the project, default target is the full compiled project (targets are the source names and stages - filters, transforms etc.).
    • I was thinking of adding #<subset> in the end, but decided to leave it for future.
  • Added new dataset building concepts:
    • data source - basically, an URL + dataset format name
    • stage - a modification of a data source. A transformation, filter or something else.
    • build tree - a directed graph (tree) with leaf nodes at data sources and a single root node called "project"
    • build target - a data source or a stage
    • pipeline - a subgraph of a build target

CLI:

  • Added local revpath and full revpath concepts in CLI
  • Added source add, commit, checkout, log CLI commands
  • Removed import, project merge CLI commands
  • diff and ediff are joined into a single diff command
  • diff, merge, explain now accepts source / dataset / revision / project specs

API:

  • Project is completely rewritten and has a new interface.
  • Project file layout is changed
    • Added v1->v2 migration on loading of an old project. The old project will be rewritten inplace.

Open problems:

  • Inplace saving when subsets removed (#348)
  • Performance of hashing on source import
    • Import URL cache / a way to avoid redownloading and caching of an URL (an optimization)

zhiltsov-max avatar Jul 26 '21 10:07 zhiltsov-max