datumaro
datumaro copied to clipboard
[WIP] Remote repositories for versioning
Summary
Related to #130, #131
Key changes:
- Added integration with Git and DVC for versioning.
- Added support for remote repository for dataset configuration.
- Added support of remote data sources for project datasets (with DVC. Available HTTP, s3, Git, DVC).
- A project now consists of a number of data sources - local or remote ones.
- Removed support of project's own datasets.
CLI changes:
- Modifying operations on project data (
transform
,filter
,export
) are being recorded now. They can be reproduced after withbuild
command. - Updated config file structure. Old projects can be read, but they will be saved with a new version.
- Updated installation: to install Datumaro with Git and DVC support, add
[VCS]
suffix:pip install <url>[VCS]
- Added a number of versioning commands in CLI:
tag
,pull
,push
,checkout
,commit
- Added
remote
CLI context to interact with data remotes of a project - Added
repo
CLI context to interact with bound Git repositories of a project
Library changes:
-
Project
class has been significantly changed, however, most of the code should work with minimal, or no changes. -
Project
s without binding to a local disk are considereddetached
. In this mode aProject
can only interact with locally available data (no remotes) - mostly, exactly the way it was working prior changes. No versioning capabilities is available in this mode.
How to test
Checklist
- [x] I submit my changes into the
develop
branch - [ ] I have added description of my changes into CHANGELOG
- [ ] I have updated the documentation accordingly
- [x] I have added tests to cover my changes
- [ ] I have linked related issues)
License
- [x] I submit my code changes under the same MIT License that covers the project. Feel free to contact the maintainers if that's a concern.
- [x] I have updated the license header for each file (see an example below)
# Copyright (C) 2020 Intel Corporation
#
# SPDX-License-Identifier: MIT
@nmanovic, implemented:
datum create
# addition (url format here: https://dvc.org/doc/command-reference/import-url)
# with auto remotes:
datum add path/ -f image_dir
datum add path/to.json -f coco_instances
# with manual remotes:
datum remote add s3://net.loc -n r1
datum source add remote://r1/path/to.xml -f cvat
datum filter (not checked)/transform # copying variant
datum export
datum build
datum commit
datum source *
datum remote *
@zhiltsov-max , could you please resolve conflicts? Are you going to move on GitHub Actions (ask Anastasia to help)?
Where can I find installation variants in documentation? For example, pip -e .[vcs]
?
TODOs:
- [ ]
datum convert
with a project source (another PR) - [ ] Pretty, useful and reliable output of
datum status
- [ ] Documentation, installation info
- [ ] A section about using DVC and Git directly
- [ ] A section about internal implementation and project structure
- [x]
datum model
update - [ ]
datum (e)diff
with 2 revisions (another PR) - [ ]
datum merge
with 2 revisions (another PR) - [x] cli tests
Pushing to sources is not in scope of this patch.
@zhiltsov-max , should we close the PR?
It will be continued after the first one as "remote sources support".
What happened to this PR? Versioning sounds like a very good idea for dataset management
@leeyh20, it is split into 2 parts - this one with remotes and #238 with local commands.
Any update on this?
@JaviFuentes94, not yet - currently, we have no resources for this task. We are welcome for ideas and suggestions on this functionality, though. Could you describe your use cases?