Idea

Cloud storages (S3, GCS, Azure blob) do support file versioning. Why don't use cloud versioning instead of DVC cache in the storage level while keeping the DVC cache locally?

Motivation: Because of DVC name, many users consider DVC as data management and data versioning tool. They expect DVC to work with their existing datasets (for example, an existing directory in S3). They try DVC and stop using it because DVC requires them to convert the dataset to DVC cache.

Solution

DVC has to use cloud versioning when it’s possible while keeping the local cache as is.

More details:

In order to use DVC, customers have to give up human-readability of input dataset directory (in S3) and convert it to DVC-cache. Or duplicate data (see 2) without data lineage (see 3).
Data duplication: the human-readable dataset directory (that's usually still needed) plus DVC cache.
They need to support a workflow for synchronizing the directory and DVC cache.

Human-readable cloud storage:

.
├── dir1
│   └── f2
├── dir2
│   └── f3
└── f1

f1 has 4 versions, dir1/f2 - 2 versions and dir2/f3 - 3 versions.

3 snapshots are versioned by DVC & Git.

Note, some of the file versions are not versioned by DVC but still exist in cloud storage: ver 3 of f1 and ver 2 of dir2/f3.

User workflow

The user workflow should stay the -

$ dvc init
$ dvc add data.xml
$ git add data.xml.dvc .gitignore
$ dvc remote add -d storage s3://mybucket/dvcstore
$ dvc push

New change - dvc push should recognize if the storage supports versioning** (it should be enabled on bucket level) and if versioning is enabled DVC should copy file and create a new version.

As a result, a user will see a file s3://mybucket/dvcstore/data.xml instead of s3://mybucket/dvcstore/f3/6e4c9d82b2fd7d8680271716d47406

If a file is modified and pushed again - a user with see the same s3://mybucket/dvcstore/data.xml but the version will change.

If versioning is not supported (in bucket or storage type like NFS), DVC should create a regular cache dir in cloud.

Internals

DVC has to support filename mapping to cloud version mapping in .dvc/lock files and perform all operations (pull/push/checkout) regular way.

The source of truth for versioning?

DVC versions set of files as a single snapshot/commit, not individual files. Git commit with its .dvc/lock are the source of truth for the versioning.

Some file modifications might happened in cloud storage. The modification should not break the versioning until a particular version of file is removed.

Note, the files in cloud directory do not necessary represent the newest version of the dataset/workspace because someone can overwrite file (create a new version). Like creating Ver 5 of File 1 in the diagram above without commiting it.

Out of scope

New flows when user modifies S3 directory manually in AWS console UI and propagates the change to dvc repo (like dvc update-from-s3-head).

Technical risks

Cloud versioning is not supported in fsspec library that DVC uses to access storages From @pmrowla:

Currently `fsspec` dependencies only include support for cloud versioning in S3 (not Azure and GCS).
The underlying Microsoft/Google SDKs/libraries used in the fsspec Azure and GCS filesystems do
support the cloud versioning features for each platform, but we will still have to contribute the
implementation in fsspec to expose the cloud versioning features in the respective fsspec filesystems
(and standardize them for S3/Azure/GCS)

dvc file format change is needed a. We might need to add cloud versions of files to .dvc/lock file in additional to md5 checksums b. Is it possible to handle the directories as is? Like dir1/dir2/f.csv is stored to s3://mybucket/dvcstore/dir1/dir2/f.csv - It looks like .dir files are needed only for local dvc-cache to reflect the cloud storage structure. So, cloud directories work as is. Local/dvc-cache directories should be handled as usual in DVC. - A shortcut: push the .dir files to cloud (okay solution for this proposal that can be improved later) - Possible later improvement: to have a special mode for repositories/projects with a small number of files and save all meta info to .dvc and dvc.lock
Backward compatibility. Can we add cloud versioning optional to keep the compatibility until we decide to fully move to it?
What to do about run-cache. Obvious solution - push as is to cloud when user asks (--run-cache option).

Related links

#3920
#5923

Jul 10 '22 23:07 dmpetrov

The versioning cloud cache is more like a workspace instead of a cache space.

Jul 11 '22 03:07 karajan1001

Thanks to great work (e.g. https://github.com/iterative/dvc/issues/8106 , etc) by @pmrowla we now have a working POC in the main branch. E.g.

!/bin/bash

set -e
set -x

rm -rf repo
mkdir repo
CONTAINER=mytestcontainer
BLOB=subdir/foo
MYREMOTE=azure://$CONTAINER/subdir
AZURE_STORAGE_CONNECTION_STRING="MYCONSTRING"

pushd repo
git init
dvc init

echo foo > foo
dvc add foo
git add foo.dvc .gitignore
cat foo.dvc
git commit -m "add foo"
dvc remote add -d myremote $MYREMOTE
dvc remote modify myremote connection_string $AZURE_STORAGE_CONNECTION_STRING
dvc remote modify myremote version_aware true
dvc remote modify myremote worktree true
dvc push

rm -rf .dvc/cache
rm -rf foo

dvc pull
cat foo | grep foo
cat foo.dvc

Sep 07 '22 20:09 efiop

@dberenbaum Can we have here some high-level examples of how we want the common use cases to look like (which commands/flags would be used, how the metadata looks, etc)?

My understanding and doubts regarding current status/proposals:

Setup

Common for all use cases:

dvc remote add -d myremote $MYREMOTE
dvc remote modify myremote version_aware true
dvc remote modify myremote worktree true

Adding data to the project

Depends on where the existing data is located:

Local

Current Status	#8826
`$ dvc add data $ dvc push`	`$ dvc add data --to-remote`

Resulting .dvc file:

Current Status #8357

outs:
- md5: 168fd6761b9c3f2c081085b51fd7bf15.dir
  size: 8
  nfiles: 2
  path: data
  files:
  - size: 4
    version_id: N4N12.5Y22XnXRh8HsSHhkGjQ9NMZowX
    etag: c157a79031e1c40f85931829bc5fc552
    md5: c157a79031e1c40f85931829bc5fc552
    relpath: bar
  - size: 4
    version_id: vV4gkMPY15rNhB73IdYOFAiHoHXSxSHe
    etag: d3b07384d113edec49eaa6238ad5ff00
    md5: d3b07384d113edec49eaa6238ad5ff00
    relpath: foo

outs: 
- path: data
  files:
  - size: 4
    version_id: N4N12.5Y22XnXRh8HsSHhkGjQ9NMZowX
    etag: c157a79031e1c40f85931829bc5fc552
    md5: c157a79031e1c40f85931829bc5fc552
    relpath: bar
  - size: 4
    version_id: vV4gkMPY15rNhB73IdYOFAiHoHXSxSHe
    etag: d3b07384d113edec49eaa6238ad5ff00
    md5: d3b07384d113edec49eaa6238ad5ff00
    relpath: foo

Remote

Not even sure how this works conceptually

Current Status	#8704
dvc import-url --version-aware $MYREMOTE/data	`$ dvc add $MYREMOTE/data --worktree`

Resulting .dvc file:

Current Status #8357

md5: 55a07d59a370e21de76efdd00708f42f
frozen: true
deps:
- md5: 7107a4fa0239e130653af7e45899bb1d.dir
  size: 8
  nfiles: 2
  path: s3://diglesia-bucket/data
  files:
  - size: 4
    version_id: N4N12.5Y22XnXRh8HsSHhkGjQ9NMZowX
    etag: c157a79031e1c40f85931829bc5fc552
    relpath: bar
  - size: 4
    version_id: vV4gkMPY15rNhB73IdYOFAiHoHXSxSHe
    etag: d3b07384d113edec49eaa6238ad5ff00
    relpath: foo
outs:
- md5: 168fd6761b9c3f2c081085b51fd7bf15.dir
  size: 8
  nfiles: 2
  path: data
  push: false

Should it look the same as the Local example above?

Track updates

My understanding is that no matter the origin of the data, we should have reached an equivalent status for the project.

Depends on where the updates were done:

Local

$ dvc add data
$ dvc push

Remote

$ dvc pull ??  / dvc update ??

Jan 17 '23 20:01 daavoo

Good questions @daavoo.

Here are some minor clarifications:

Common for all use cases:

dvc remote add -d myremote $MYREMOTE
dvc remote modify myremote version_aware true
dvc remote modify myremote worktree true

You only need one or the other of the dvc remote modify lines now.

Also, for #8826 (or if you do import-url), this step isn't needed. If you want to use cloud versioning remotes like cache remotes, where you have one to represent your entire repo, then you need to setup the remote. However, if you only want to version an existing dataset, you should not need to configure the remote manually.

Resulting .dvc file:

This was more of a rough suggestion. It sounds like the local and remote info may be better kept separate. I'm hoping the team can discuss and offer a better schema.

Should it look the same as the Local example above?

Yes.

For a version_aware remote, it's basically the same use cases as cache remotes but more readable (for example, https://github.com/iterative/iterative.ai/issues/690).

For a worktree remote, here are some possible use cases:

Integration with non-DVC tools

Example: Iterating between a data labeling tool and DVC (for model training).

I start with a dataset in the cloud and connect my data labeling tool to read from it.
I use dvc add [--worktree?] ... to add it to my repo.
Based on the model results, I modify my dataset and dvc push back to the cloud.
I sync that data back to my labeling tool, then add some new data to my cloud storage outside of DVC and label it.
I use dvc update ... to fetch the latest data and labels and retrain.

Integration with non-DVC users

Example: Making a data registry where consumers don't need to know DVC.

Start with an existing data registry repo that has everything stored in a cache remote.
Add and push to a worktree remote with dvc remote add ... -> dvc remote modify ... -> dvc push -r ....
Consumers with read-only access get the latest data from cloud storage without DVC (using aws cli, boto, etc.). They don't have to know that the data registry is backed by a Git repo or uses DVC (users who need to specify a version of the dataset can still use dvc import, but others may be fine to always use the latest version).
To update the data, someone still goes through the normal data registry process of submitting a PR with the changes and pushing the data to the cache remote. When the PR is merged, the updated data is pushed to the worktree remote.

Use pipelines while keeping data in the cloud (#8411)

Example: I want to use DVC pipelines, but I don't want to move my data that is already on the cloud or refactor my code that read/writes from/to the cloud directly (or it is too big to keep a local copy).

Set up one or more stages where all of the data is directly in the cloud, like dvc stage add -n prep -d s3://mybucket/raw_data -o s3://prepped_data .... TBD the exact syntax; let's discuss in #8411. Unlike current external outputs, there's no need to set up a remote cache. This should "just work" like local outputs, with at most an additional flag needed in dvc stage add.
Use dvc repro/exp run to run the pipeline and capture the version IDs in dvc.lock. The cloud-versioned outputs in dvc.lock should look similar to what's in the .dvc files for cloud-versioned datasets.

Jan 18 '23 01:01 dberenbaum

If understand correctly, Integration with non-DVC users, Integration with non-DVC tools, not sure about pipelines can be applied to the regular version_aware remotes.

@dberenbaum I'm not sure if we have already a place in docs for this, but we can move your examples there.

Feb 25 '23 02:02 shcheklein

@dberenbaum I'm not sure if we have already a place in docs for this, but we can move your examples there.

🤔 There are some critical differences:

What a non-DVC user sees in the current version on cloud may not match the latest versions pushed from DVC.
There's no way to both push changes to a DVC remote and retrieve non-DVC updates from that same location (you can either push to a cloud-versioned remote or use import-url to get updates for non-DVC data).

Thoughts on the different scenarios mentioned above:

Integration with non-DVC tools

Example: Iterating between a data labeling tool and DVC (for model training).

For most use cases, it seems like the better path here is to use DVC to track the annotations and not bother tracking the raw data, which is often immutable/append-only.

I also don't see that tools like Label Studio currently support cloud version IDs.

Integration with non-DVC users

Example: Making a data registry where consumers don't need to know DVC.

With or without worktree, we probably need both: somewhere producers can push changes for review without modifying what the consumers see, and another place with the latest approved versions for consumers. I think the workflow here could be accomplished by having a CI/CD workflow syncing the latest versions to some separate consumer-facing URL when merging new data into the main branch (or we could consider something like export-url again to do this within DVC).

Pipelines

This one might make sense to support in the future with version_aware (see #8411) since I don't see a great alternative.

Feb 28 '23 18:02 dberenbaum

dvc dvc copied to clipboard

Cloud versioning

Idea

Solution

User workflow

Internals

The source of truth for versioning?

Out of scope

Technical risks

Related links

Setup

Adding data to the project

Local

Remote

Track updates

Local

Remote

Integration with non-DVC tools

Integration with non-DVC users

Use pipelines while keeping data in the cloud (#8411)

Integration with non-DVC tools

Integration with non-DVC users

Pipelines

dvc
dvc copied to clipboard