datumaro
datumaro copied to clipboard
Support dataset versioning
After releasing a dataset, it is often needs to be updated and maintained. In this process, or even in the process of experimenting and building the first version, it is very desirable to be able to reproduce any previous state of a dataset. Some solutions already provide similar functionality, so they might be helpful in implementation: DVC.
Requirements:
- Revision tagging
- Version navigation (undo/redo for a project)
- Ability to reproduce (build, retrieve) any previous version of a project
- Tracking of local changes
- Local caching for intermediate versions
- Version comparisons
- Generation of patches
- Version checking
- Synchronization with remote storages
- Checking for updates
- Reading
- Writing
In the first iteration (#238) the following features are added:
General:
- Added an option to record dataset operations and navigate over them
- A project now contains several datasets, which can be operated on independently
- By default, dataset modifications are recorded and done inplace
- Operations can be applied to a specific dataset, or to the combined dataset (the "project")
- Implemented a Git-like navigation over project states (revisions)
- Added working tree and revision cache
- Added data deduplication between revisions
Glossary:
- Added new versioning concepts:
- revision - corresponds to a Git commit
- working / head / revision tree - a project build tree and plugins at a specified revision
- object - a revision or a dataset
- Added new dataset path concepts:
- source / dataset / revision / project paths - a path to a dataset in a special format
- [project local] rev[ision] paths - a way to specify the path to a source revision in the CLI, the syntax is:
<revision>:<source/target name>
. Any part can be missing. Arevision
is a commit hash or a named reference from Git (branch, tag, HEAD~3 etc.). -
full revpaths - a way to specify path to a dataset of a source revision in a project, the syntax is:
-
<dataset path>:<format>
,format
is optional -
<project path>@<revision>:<target name>
, any part can be missing. Default project is thecurrent project
(-p
CLI arg.), defaultrevision
is the working directory of the project, default target is the full compiled project (targets are the source names and stages - filters, transforms etc.).
-
- [project local] rev[ision] paths - a way to specify the path to a source revision in the CLI, the syntax is:
- I was thinking of adding
#<subset>
in the end, but decided to leave it for future.
- source / dataset / revision / project paths - a path to a dataset in a special format
- Added new dataset building concepts:
- data source - basically, an URL + dataset format name
- stage - a modification of a data source. A transformation, filter or something else.
- build tree - a directed graph (tree) with leaf nodes at data sources and a single root node called "project"
- build target - a data source or a stage
- pipeline - a subgraph of a build target
CLI:
- Added local revpath and full revpath concepts in CLI
- Added
source add
,commit
,checkout
,log
CLI commands - Removed
import
,project merge
CLI commands -
diff
andediff
are joined into a singlediff
command -
diff
,merge
,explain
now accepts source / dataset / revision / project specs
API:
-
Project
is completely rewritten and has a new interface. - Project file layout is changed
- Added v1->v2 migration on loading of an old project. The old project will be rewritten inplace.
Open problems:
- Inplace saving when subsets removed (#348)
- Performance of hashing on source import
- Import URL cache / a way to avoid redownloading and caching of an URL (an optimization)