dvc
dvc copied to clipboard
Cloud versioning
Idea
Cloud storages (S3, GCS, Azure blob) do support file versioning. Why don't use cloud versioning instead of DVC cache in the storage level while keeping the DVC cache locally?
Motivation: Because of DVC name, many users consider DVC as data management and data versioning tool. They expect DVC to work with their existing datasets (for example, an existing directory in S3). They try DVC and stop using it because DVC requires them to convert the dataset to DVC cache.
Solution
DVC has to use cloud versioning when itβs possible while keeping the local cache as is.
More details:
- In order to use DVC, customers have to give up human-readability of input dataset directory (in S3) and convert it to DVC-cache. Or duplicate data (see 2) without data lineage (see 3).
- Data duplication: the human-readable dataset directory (that's usually still needed) plus DVC cache.
- They need to support a workflow for synchronizing the directory and DVC cache.

Human-readable cloud storage:
.
βββ dir1
βΒ Β βββ f2
βββ dir2
βΒ Β βββ f3
βββ f1
f1 has 4 versions, dir1/f2 - 2 versions and dir2/f3 - 3 versions.
3 snapshots are versioned by DVC & Git.
Note, some of the file versions are not versioned by DVC but still exist in cloud storage: ver 3 of f1 and ver 2 of dir2/f3.
User workflow
The user workflow should stay the -
$ dvc init
$ dvc add data.xml
$ git add data.xml.dvc .gitignore
$ dvc remote add -d storage s3://mybucket/dvcstore
$ dvc push
New change - dvc push should recognize if the storage supports versioning** (it should be enabled on bucket level) and if versioning is enabled DVC should copy file and create a new version.
As a result, a user will see a file s3://mybucket/dvcstore/data.xml instead of s3://mybucket/dvcstore/f3/6e4c9d82b2fd7d8680271716d47406
If a file is modified and pushed again - a user with see the same s3://mybucket/dvcstore/data.xml but the version will change.
If versioning is not supported (in bucket or storage type like NFS), DVC should create a regular cache dir in cloud.
Internals
DVC has to support filename mapping to cloud version mapping in .dvc/lock files and perform all operations (pull/push/checkout) regular way.
The source of truth for versioning?
DVC versions set of files as a single snapshot/commit, not individual files. Git commit with its .dvc/lock are the source of truth for the versioning.

Some file modifications might happened in cloud storage. The modification should not break the versioning until a particular version of file is removed.
Note, the files in cloud directory do not necessary represent the newest version of the dataset/workspace because someone can overwrite file (create a new version). Like creating Ver 5 of File 1 in the diagram above without commiting it.
Out of scope
- New flows when user modifies S3 directory manually in AWS console UI and propagates the change to dvc repo (like
dvc update-from-s3-head).
Technical risks
- Cloud versioning is not supported in
fsspeclibrary that DVC uses to access storages From @pmrowla:
Currently `fsspec` dependencies only include support for cloud versioning in S3 (not Azure and GCS).
The underlying Microsoft/Google SDKs/libraries used in the fsspec Azure and GCS filesystems do
support the cloud versioning features for each platform, but we will still have to contribute the
implementation in fsspec to expose the cloud versioning features in the respective fsspec filesystems
(and standardize them for S3/Azure/GCS)
- dvc file format change is needed
a. We might need to add cloud versions of files to .dvc/lock file in additional to md5 checksums
b. Is it possible to handle the directories as is? Like
dir1/dir2/f.csvis stored tos3://mybucket/dvcstore/dir1/dir2/f.csv- It looks like .dir files are needed only for local dvc-cache to reflect the cloud storage structure. So, cloud directories work as is. Local/dvc-cache directories should be handled as usual in DVC. - A shortcut: push the .dir files to cloud (okay solution for this proposal that can be improved later) - Possible later improvement: to have a special mode for repositories/projects with a small number of files and save all meta info to .dvc and dvc.lock - Backward compatibility. Can we add cloud versioning optional to keep the compatibility until we decide to fully move to it?
- What to do about run-cache. Obvious solution - push as is to cloud when user asks (
--run-cacheoption).
Related links
- #3920
- #5923
The versioning cloud cache is more like a workspace instead of a cache space.
Thanks to great work (e.g. https://github.com/iterative/dvc/issues/8106 , etc) by @pmrowla we now have a working POC in the main branch. E.g.
!/bin/bash
set -e
set -x
rm -rf repo
mkdir repo
CONTAINER=mytestcontainer
BLOB=subdir/foo
MYREMOTE=azure://$CONTAINER/subdir
AZURE_STORAGE_CONNECTION_STRING="MYCONSTRING"
pushd repo
git init
dvc init
echo foo > foo
dvc add foo
git add foo.dvc .gitignore
cat foo.dvc
git commit -m "add foo"
dvc remote add -d myremote $MYREMOTE
dvc remote modify myremote connection_string $AZURE_STORAGE_CONNECTION_STRING
dvc remote modify myremote version_aware true
dvc remote modify myremote worktree true
dvc push
rm -rf .dvc/cache
rm -rf foo
dvc pull
cat foo | grep foo
cat foo.dvc
@dberenbaum Can we have here some high-level examples of how we want the common use cases to look like (which commands/flags would be used, how the metadata looks, etc)?
My understanding and doubts regarding current status/proposals:
Setup
Common for all use cases:
dvc remote add -d myremote $MYREMOTE
dvc remote modify myremote version_aware true
dvc remote modify myremote worktree true
Adding data to the project
Depends on where the existing data is located:
Local
| Current Status | #8826 |
|
|
Resulting .dvc file:
| Current Status | #8357 |
|
|
Remote
Not even sure how this works conceptually
| Current Status | #8704 |
|
dvc import-url --version-aware $MYREMOTE/data |
|
Resulting .dvc file:
| Current Status | #8357 |
|
Should it look the same as the |
Track updates
My understanding is that no matter the origin of the data, we should have reached an equivalent status for the project.
Depends on where the updates were done:
Local
$ dvc add data
$ dvc push
Remote
$ dvc pull ?? / dvc update ??
Good questions @daavoo.
Here are some minor clarifications:
Common for all use cases:
dvc remote add -d myremote $MYREMOTE dvc remote modify myremote version_aware true dvc remote modify myremote worktree true
You only need one or the other of the dvc remote modify lines now.
Also, for #8826 (or if you do import-url), this step isn't needed. If you want to use cloud versioning remotes like cache remotes, where you have one to represent your entire repo, then you need to setup the remote. However, if you only want to version an existing dataset, you should not need to configure the remote manually.
Resulting
.dvcfile:
This was more of a rough suggestion. It sounds like the local and remote info may be better kept separate. I'm hoping the team can discuss and offer a better schema.
Should it look the same as the
Localexample above?
Yes.
For a version_aware remote, it's basically the same use cases as cache remotes but more readable (for example, https://github.com/iterative/iterative.ai/issues/690).
For a worktree remote, here are some possible use cases:
Integration with non-DVC tools
Example: Iterating between a data labeling tool and DVC (for model training).
- I start with a dataset in the cloud and connect my data labeling tool to read from it.
- I use
dvc add [--worktree?] ...to add it to my repo. - Based on the model results, I modify my dataset and
dvc pushback to the cloud. - I sync that data back to my labeling tool, then add some new data to my cloud storage outside of DVC and label it.
- I use
dvc update ...to fetch the latest data and labels and retrain.
Integration with non-DVC users
Example: Making a data registry where consumers don't need to know DVC.
- Start with an existing data registry repo that has everything stored in a cache remote.
- Add and push to a worktree remote with
dvc remote add ...->dvc remote modify ...->dvc push -r .... - Consumers with read-only access get the latest data from cloud storage without DVC (using aws cli, boto, etc.). They don't have to know that the data registry is backed by a Git repo or uses DVC (users who need to specify a version of the dataset can still use
dvc import, but others may be fine to always use the latest version). - To update the data, someone still goes through the normal data registry process of submitting a PR with the changes and pushing the data to the cache remote. When the PR is merged, the updated data is pushed to the worktree remote.
Use pipelines while keeping data in the cloud (#8411)
Example: I want to use DVC pipelines, but I don't want to move my data that is already on the cloud or refactor my code that read/writes from/to the cloud directly (or it is too big to keep a local copy).
- Set up one or more stages where all of the data is directly in the cloud, like
dvc stage add -n prep -d s3://mybucket/raw_data -o s3://prepped_data .... TBD the exact syntax; let's discuss in #8411. Unlike current external outputs, there's no need to set up a remote cache. This should "just work" like local outputs, with at most an additional flag needed indvc stage add. - Use
dvc repro/exp runto run the pipeline and capture the version IDs indvc.lock. The cloud-versioned outputs indvc.lockshould look similar to what's in the .dvc files for cloud-versioned datasets.
If understand correctly, Integration with non-DVC users, Integration with non-DVC tools, not sure about pipelines can be applied to the regular version_aware remotes.
@dberenbaum I'm not sure if we have already a place in docs for this, but we can move your examples there.
@dberenbaum I'm not sure if we have already a place in docs for this, but we can move your examples there.
π€ There are some critical differences:
- What a non-DVC user sees in the current version on cloud may not match the latest versions pushed from DVC.
- There's no way to both push changes to a DVC remote and retrieve non-DVC updates from that same location (you can either
pushto a cloud-versioned remote or useimport-urlto get updates for non-DVC data).
Thoughts on the different scenarios mentioned above:
Integration with non-DVC tools
Example: Iterating between a data labeling tool and DVC (for model training).
For most use cases, it seems like the better path here is to use DVC to track the annotations and not bother tracking the raw data, which is often immutable/append-only.
I also don't see that tools like Label Studio currently support cloud version IDs.
Integration with non-DVC users
Example: Making a data registry where consumers don't need to know DVC.
With or without worktree, we probably need both: somewhere producers can push changes for review without modifying what the consumers see, and another place with the latest approved versions for consumers. I think the workflow here could be accomplished by having a CI/CD workflow syncing the latest versions to some separate consumer-facing URL when merging new data into the main branch (or we could consider something like export-url again to do this within DVC).
Pipelines
This one might make sense to support in the future with version_aware (see #8411) since I don't see a great alternative.