dvc Mechanism to update a dataset w/o downloading it first

Mechanism to update a dataset w/o downloading it first

Open shcheklein opened this issue 3 years ago • 28 comments

Story

When I have a dataset with 1M images I want to update it- add one more image file I have to download and checkout the previous version first now It takes long time

Request

Come up with a set of commands/options and a flow to do this efficiently w/o downloading data first

Oct 02 '20 01:10 shcheklein

I'm facing a similar issue and I think this can be easily fixed in the case where the entire folder is tracked by dvc.

If someone wants to upload a single image to a folder tracked by dvc

calculate md5 of the upload
upload it to the remote
get the md5.dir from the directory
append the file md5 to the list.
upload the new md5.dir to the remote.

just 2 file uploads and 1 download.

Maybe the solution is to allow the user to run dvc pull --dir-only to download only md5.dir references. That's all that dvc needs to upload a new file to a folder.

Oct 02 '20 06:10 MetalBlueberry

Fully agree with @MetalBlueberry: it will be nice to have an option to pull only directories from remote (dvc pull --dir-only). Also, the high-level command for this case can simplify user experience:

dvc add --merge-into data.dvc data/

which will perform the following operations:

dvc pull --dir-only
add new files to relevant dirs:
1. update dir's content
2. remove previous md5.dir file (since the content of the file determines its name)
3. create new md5.dir file

As additional feature we can implement merge of two different dvc files of same dir:

dvc add data/ --file data_update.dvc
dvc add --merge-into data.dvc data_update.dvc

This feature can be useful when teammates working in different branches and different data were collected.

@shcheklein @efiop, I can start the implementation of this feature, if you're okay with this view.

Oct 04 '20 13:10 puhoshville

@MetalBlueberry @puhoshville Great ideas!

My 2c:

dvc pull --dir-only

Usually we just automatically pull .dir files when we need them without asking. E.g. https://github.com/iterative/dvc/blob/fcdb503b4eff1297e5c1c1ed18f5606fe780e481/dvc/output/base.py#L383 , we could probably do the same here, without adding a special --dir-only flag.

ii. remove previous md5.dir file (since the content of the file determines its name)

No need to remove the old .dir file. Dvc doesn't ever remove cache files except for gc, as cache might be used by multiple repo instances and you never know if that .dir file is being used by someone else. And, well, those are pretty tiny files anyway, so no need to be paranoid about cleaning them up during the merge.

@puhoshville I see that you use --merge-into data.dvc, which makes me think that you might be confusing the data with the metafile for it. The operation is adding something to a dir, so it should reflect that in the command. E.g. dvc add data --append , and the corresponding meta file could be just found automatically by dvc.

As additional feature we can implement merge of two different dvc files of same dir:

Hm, if I understand you correctly, you probably mean https://dvc.org/doc/user-guide/merge-conflicts#append-only-directories ?

Oct 04 '20 17:10 efiop

We could also create a separate command for it:

$ dvc append data.dvc file --path data/raw/file1
# append `file` as `data/raw/file1` into `data` directory (i.e. the `out` of `data.dvc`).

It will pull .dir files, updates it in the cache and modifies the data.dvc file.

Oct 05 '20 12:10 skshetry

I like the idea of dvc pull --dir-only - it can bring the entire dir structure, help users to navigate in the dir and simplify API as a result.

--merge-into seems like a possible solution but it does not cover file removals. Also, it requires creating a separate dir structure while we can utilize the existing one (if I understand correctly data does not necessarily match the actual dir name dvc add --merge-into data.dvc data/).

Ideally, a user should not create a separate data dir and mess with dvc files (which might not exist if we decide to use run-cache for data files at some point). We should utilize the existing dir structure and regular CLI experience (cp, mv, ls) as much as we can.

It might look like:

$ dvc pull --dir-only data/
$ cp ~/Downloads/users-2020-10-05.tsv data/rawfiles/cust/
$ cp ~/Downloads/report-2020-10-01.tsv data/summaries/  # it can owervrite an existing file
$ dvc add --update data/ # or --append

# how to handle removes:
$ dvc add --remove data/users/update-2020-10.csv
$ dvc add --update data/

# apply all the changes
$ dvc push data/
$ git add data.dvc
...

Navigation

We might need a separate functionality for the navigation in these virtual/partaly-downloaded dirs - like list:

$ dvc pull --dir-only data/
$ dvc list --dir-only data/rawfiles/cust/ # without downloading files
users-2020-10-04.tsv
users-2020-10-02.tsv
users-2020-09-29.tsv
...
# Okay, now I know what to delete\replace
$ cp ~/Downloads/fixed-users.tsv data/rawfiles/cust/users-2020-10-02.tsv
$ dvc add --remove data/rawfiles/cust/users-2020-10-04.tsv
$ dvc add --update

Naming

Agree with @skshetry - a separate command might be helpful. But I'd not limit by append. We also need remote, update and navigate. It might look like dvc vdir add/remove/list.

PS: the navigation and file deletion is not part of this issue. But we should come up with the API that won't block us from these scenarios.

Oct 05 '20 18:10 dmpetrov

Hey fellows, @shcheklein asked me to chime in on this ticket with my use case. So, here it is.

I have created a DVC Data Registry with the intention of updating it automatically whenever one of my edge devices decides to push data to it. It looks something like this:

data_registry_use_case

What I have tried:

Initially, I thought I could just create a sessions dir in the Data Registry and every time I would push to it from any of the edge devices it should just work. That didn't pan out. Whenever one of the edge devices does a dvc add sessions/ it creates a sessions.dvc file and when it's subsequently pushed to Github data registry (git push) and remote storage (dvc push), only the most recent sessions.dvc file is visible, pointing to the most recently pushed data. In, simple words DVC does not merge datasets put into the same directory. One way to solve it would be that each edge device should first pull all the data, copy the local data to the pulled directory, and then finally push. One major problem that can arise is of race-conditions, eg. if Machine-A and Machine-B start pushing data at the same time and Machine-A finishes first either Machine-B's push may either get-lost, rejected, or be overwritten. Secondly, for every edge device to first pull data before pushing is really bandwidth and storage-intensive and just an overall naive solution to the problem at hand.

The other way I was guided on the discord channel was to add data with more granularity, instead of just adding a directory as a whole, so that each individual machine can push data to a directory independently. Something like this:

___________
Data Registry
___________
> Session
   > Data_A_timestamp.dvc
   > Data_A_timestamp
   ...
   ...
  > Data_C_timestamp.dvc
  > Data_C_timestamp
 ...
 ...

In this case, there is no monolithic sessions directory (no more sessions.dvc) tracked by DVC, instead, all the edge devices are doing something like this dvc add sessions/Data_A_timestamp which generates its respective .dvc file. This also has one problem, edge devices will still need to clone and keep the data registry from Github updated locally for a successful push.

I have not settled on a definitive answer to this problem. Hopefully, adding my two cents to this issue will help me in finding a definitive answer.

Oct 15 '20 09:10 RafayAK

Hello again!

We've developed a system to upload files to a dvc tracked folder without actually using dvc. It basically performs the steps described in the previous comment.

seeing this post, looks like there are more people interested in having cron jobs to upload data to dvc tracked repositories from edge devices. is it worth developing a small cli app for this purpose?

no python/git dependencies
focus on optimal file upload
delete uploaded files so it can run on devices with small hdd.

If I find people interested in this, I will be really happy to implement it.

Nov 09 '20 11:11 MetalBlueberry

@MetalBlueberry that sounds great. I would love to have that feature in DVC or try out your system (if you've open-sourced it). If any help is required in implementing this feature, I'm completely up for it.

Nov 09 '20 11:11 RafayAK

@MetalBlueberry you got me intrigued :) Could you share some details please? Especially, regarding the no python/git dependency. Primarily my understanding is that at some point to make DVC-tracked data accessible we have to save the new checksum somewhere (usually Git via .dvc) file. Otherwise, even if we, let's say, push the new, modified .dir, it's not clear how to we read it later? how downstream system "understand" that there is some change?

@RafayAK thanks for taking your time, btw sharing the use case. We are working right now on a POC for this problem. I can't promise that it will be no-git/no-python, but at least update won't require pulling the whole dataset. If multiple machine update it simultaneously- one of the will get a Git conflict and will need to run the operation again. Would it be a reasonable workflow for you?

Nov 10 '20 01:11 shcheklein

These are the steps:

calculate md5 of the upload
upload it to the remote:
get the md5.dir from the directory:
append the file md5 to the list.
upload the new md5.dir to the remote.

I'm planning to use the following packages

"crypto/md5" for the checksum
"https://github.com/google/go-cloud /blob" upload/download of blob files. handles different providers like DVC
"https://github.com/src-d/go-git" to replace git cli
"https://github.com/go-ini/ini" to parse dvc configuration

So the idea is to download the git repo in memory with go-git. Then parsing the .dvc/config to discover the remote configuration. Now we establish the connection using go-cloud/blob. All the other steps should be straight forward.

The good thing about doing this in go is that the program can be compiled into a small binary around 25Mb that doesn't require any dependency. Not even python or git.

Nov 10 '20 07:11 MetalBlueberry

@MetalBlueberry got it, it makes sense. Clearly it won't be easy to make it part of DVC, but I think the whole community can benefit and it would be cool to see some parts of DVC written in Go.

Nov 10 '20 07:11 shcheklein

@MetalBlueberry very cool way of updating the repo. I like it. One question though, when you add a new piece of data into your DVC tracked directory, the hash of the new directory should change. How will you retrieve the old hash of the directory? I suppose calculating the hash without the new additions would do the trick, but then at some point wouldn't you also have to update the hash of the tracked directory in Git/Github? Also, race-conditions could become a problem if the md5.dir is being changed by multiple machines

@shcheklein I would love to check out your POC whenever it's ready. As far as resolving Git conflicts I think your solution may workout, One more thing to add, I was going through the DVC docs and stumbled on "Append only directories" for DVC, do you think that this is a better way than resolving merge conflicts directly?

Nov 10 '20 10:11 RafayAK

@RafayAK yes! I think it's a good solution for merging append only dirs! It doesn't solve the problem of pulling the whole data though.

Nov 11 '20 02:11 shcheklein

@RafayAK here is you answer https://github.com/MetalBlueberry/dvc-uploader

As promised, I've implemented a simple cli to upload data to a dvc tracked folder. The current status is POC, but I would like to get feedback from you ASAP. Also, we can continue the discussion in that repository because this is not the right place to have a long discussion. https://github.com/MetalBlueberry/dvc-uploader/issues/1

Nov 14 '20 20:11 MetalBlueberry

Hello everyone! 👋

I've created a POC in how we could potentially integrate such functionality into DVC (based on all of the discussion here so far). The flow would essentially be:

# new workflow
$ dvc vdir pull data
$ dvc vdir cp --local-data ~/my-personal-dataset/dog.1_000_001.jpg ./data/validation/

# usual workflow
$ dvc push data
$ git add data.dvc
$ git commit -m "update validation set"
$ git push

I have an initial working version on #4900 that you can start testing. You can also take a look at the PR description which tackles this new concept much more deeply.

Looking forward to your thoughts on it!

Cheers!

Nov 16 '20 13:11 BurnzZ

Hi all, just chiming in to say that I would really love it if this was implemented!! I'm currently using DVC for just data management, and hoping to avoid the long download before adding e.g. new results files to the model results directory.

Feb 23 '21 01:02 katherinelauck

Hello everybody! I'd really like to have it (especially use-case described in OP). What's the status on this issue? It seems there is no much progress since PR #490. Are there any news or plans regarding this issue?

May 04 '21 04:05 nik123

@nik123 Thank you for your interest! Unfortunately, there is no active progress right now, we are keeping it in mind while reworking our architecture right now, but it is pretty clear that this will require serious research and design in order to start implementing.

May 04 '21 15:05 efiop

I'd also be interested in this feature. FWIW, I was imagining something like https://github.com/iterative/dvc/issues/4657#issuecomment-702560232 but as an option to dvc add. I could also see the argument for making a new command, but as a new user, the first place I looked was dvc add.

Oct 28 '21 22:10 rtimpe

Hello, for us this feature would be a critical feature since when a new annotated batch of images is done, it is sent to a gitlab CI pipeline that basically downloads the new data, uses DVC to update existing dataset and push new dataset.

Without this feature it means that our pipelines will get slower and slower and require more and more storage as the dataset grows. This is a huge irritant since this adds a problem to the equation where we now have to worry about data storage for a pipeline that ultimately will not store the data long term (it's just needed during the update job).

Also it might prevent the pipeline to run entirely if the dataset becomes too big as storage capacity will be reached...

In my opinion, this limitation goes against the main purpose of using DVC since you have to carry the size of your dataset around.

Any update on this?

Feb 18 '22 01:02 AlexandreBrown

@AlexandreBrown Hey. Thanks for your interest! We are currently working on some pre-requisites for that feature (e.g. we need a nicer way to represent this virtual data structure so we could modify it). I expect this to be available for some early testing in the next month or so. So please stay tuned :)

Feb 18 '22 10:02 efiop

I too would really appreciate this feature!

Otherwise, I'm stuck between two infeasible options:

Download the entire directory (which in some cases is 100s of GBs) and generally not interesting to a particular user
Track each file with it's own .dvc file, which I've learned the hard way is unbelievably slow.

Apr 06 '22 19:04 gamis

I have a dataset that I need to update on a daily basis but the instance on which DVC runs to update my database has only 100GB of disk space and is short-lived (killed by the end of the day) while my total dataset is around 8TB.

I cannot afford to pull 8TB of data every day just to update 3/4GB in this dataset. Because of this reason and seeing that this is still something impossible to achieve with DVC, I will have to just ditch it.

Apr 27 '22 15:04 EKami

There's a way upstream in dvc-data to update .dir file without downloading anything else, but it requires:

to download .dir file locally in the cache,
handle pushing the new .dir file to remote and newly added files yourself, and
need to update hash in the .dvc file manually.

You need to tell dvc how to update the .dir file by writing a patch file in the JSON format where you provide a list of operations to perform on the .dir file:

# patch.json
[
  {"op": "remove", "path": "test/0/00004.png"},
  {"op": "move", "path": "test/1/00003.png", "to": "test/0/00003.png"},
  {"op": "copy", "path": "test/1/00003.png", "to": "test/1/11113.png"},
  {"op": "test", "path": "test/1/00003.png"},
  {"op": "add", "path": "local/path/to/patch.json", "to": "foo"},
  {"op": "modify", "path": "local/path/to/patch.json", "to": "bar"}
]

The path for add/modify operations are relative to the patch file.

And. using dvc-data CLI, you can do following, which will generate updated .dir file

$ dvc-data update-tree <.dir-short-hash> <json_patch_file>

eg:

$ dvc-data update-tree f23d4 patch.json
object 30a856795d9872289fa45530f40884f9.dir

Note that this is a very recent change and experimental, this might get changed or removed without any notice and UX might be clunky at the moment. I am sharing it in the hopes that someone might find it useful. :)

For now, this might need https://github.com/iterative/dvc-data/pull/82 to work properly.

Jul 04 '22 17:07 skshetry

Thanks @skshetry! Great to know it's finally at least possible.

In the long-term (not prioritized yet, so no timeframe available), this workflow should happen automatically by being able to:

Granularly add/remove/modify directory contents.
Automatically pull the .dir "virtual directories."

For example:

# Create, track, and push a directory
$ mkdir dir
$ touch data/foo
$ touch data/bar
$ dvc add data
Added data
$ dvc push
3 files pushed.

# Delete from workspace and local cache
$ rm -rf data/*
$ rm -rf .dvc/cache

# Add new file to a directory without pulling
$ touch data/baz
$ dvc add data/baz
Added data/baz

# Remove a file from the directory without pulling
$ dvc remove data/bar

# Modify a file in the directory by only pulling that file
$ dvc pull data/foo
$ echo foo >> data/foo
$ dvc add data/foo
Added data/foo

Jul 05 '22 16:07 dberenbaum

Automatically pull the .dir "virtual directories."

Note that the only reason why dvc-data cannot handle pulling and pushing .dir file is because it does not understand dvc's config. We can implement that in dvc itself trivially (with the patch.json like ways as I mentioned above).

Of course, if we want to go with granular method, the solution is more complex and the issue is much more larger than just updating virtual directory.

Jul 05 '22 17:07 skshetry

Pulling/pushing the .dir files can be a good first step, like push --dir-only and fetch --dir-only. It's not enough in the long-term because the UX for how to make updates to those directories would still be complicated. Granular directory handling also makes other directory use cases simpler (users shouldn't have to know or decide whether to track as files or directories).

Jul 05 '22 17:07 dberenbaum

In our case we solved many of our issues (such as updating without loading the dataset entirely), we started using Iterative new (alpha) tool ldb instead of DVC and we're really happy so far.
One might gain from checking if this other amazing tool from Iterative could suit their needs.

Jul 05 '22 17:07 AlexandreBrown

Would love a feature like this.

Oct 30 '22 05:10 lpkoh

A proposal for this (adapted from https://www.notion.so/iterative/Auto-Manage-Directories-in-DVC-cf0b318c09384e40b4304b9434db3c5f for visibility) is to allow granular add, modify, and remove operations (to be prioritized in that order) on DVC-tracked directories.

Edit: note that this mostly summarizes the discussion above into one doc.

Granular dataset operations

DVC should automatically track nested directories internally and manage overlaps with existing paths (similar to Git).

$ mkdir data
$ dvc add data
Added data

# Add new file to directory
$ touch data/foo
$ dvc add data/foo
Added data/foo

# Modify a single file in a directory
$ echo foo >> foo
$ dvc add data/foo
Added data/foo

# Remove file from directory
$ dvc remove data/foo
Removed data/foo

Virtual directories

Use granular dataset operations to work with virtual directories even if the directory's contents aren't available in the workspace. Assume you start with a data.dvc file but the data dir is empty because you haven't pulled yet. DVC should enable you to:

# Add new file to empty tracked directory
$ cp newfile data/newfile
$ dvc add data/newfile
Added data/newfile

# Checkout/pull a file and modify it
$ dvc pull data/file1
$ echo newdata >> data/file1
$ dvc add data/file1
Added data/file1

# Stop tracking a file from an empty directory
$ dvc remove data/file2
Removed data/file2

Implementation notes

For virtual directory operations (where the full directory contents don’t exist in the local workspace), it will be necessary to have the contents of the .dir JSON files to know the expected contents of the directory and build the modified .dir file.

These could be downloaded from the remote as needed automatically by DVC, or users might be required (or have the option) to download these .dir files with a new command like dvc fetch --dir-only.

Jan 05 '23 16:01 dberenbaum

dvc dvc copied to clipboard

Mechanism to update a dataset w/o downloading it first

Navigation

Naming

Granular dataset operations

Virtual directories

Implementation notes

dvc
dvc copied to clipboard