dvc
dvc copied to clipboard
how to update the data in a data registry within another project
i went through the doc for data registry at https://dvc.org/doc/use-cases/data-registry
but i am still not clear about how to update registry
if I understand correctly, for a project that imports data from a data registry (which is a git repo), and i changed the data in that project, when i run dvc add changed_data.dvc, what i changed is the data for my project, not the repository of the data regitry. and when i run git commit what i commit to is my project's git repo, not the registry's git repo
how could i push back the new data change to the original data registry?
i got answer on discord saying
Unfortunately, there is not currently a very seamless way to push back the changes. You would need to copy the data back to the data registry repo and push it from there.
So i think this could be a feature request then ...
Maybe a workaround is to use data-registry in a project in a way that is similar to git submodule, and use something like git sparse-checkout to checkout only the data for that specific project from the data-registry as a submodule?
@shelper
Data registry is a separate project from your training and it should be treated as a "consumer" of the data rather than something that updates the registry repo. So the changes should be applied to data registry repo, and then, using dvc update the data should be updated on your training repo. As I understand you would like to update your data registry from your training repo?
@pared that's right, i wonder if it is feasible (and reasonable) to update data registry from training repo.
I guess the right question to ask here is why not update the data registry? Is there a use case where one cannot update the data registry, and should be able to update it from another repo?
I guess the right question to ask here is why not update the data registry? Is there a use case where one cannot update the data registry, and should be able to update it from another repo?
I Agree with you but just a thought here is that if the registry gets bigger, and some one who only used one dataset from the registry will have to clone it to local and update it through GitHub
Just in case the person accidentally contaminates other datasets in the registry, if there is a way to isolate in-between the datasets in the registry?
Just a few points to clarify:
I Agree with you but just a thought here is that if the registry gets bigger, and some one who only used one dataset from the registry will have to clone it to local and update it through GitHub
A person can download only single dataset to update it, no need to checkout / pull all the datasets in the registry to update a single one.
Just in case the person accidentally contaminates other datasets in the registry, if there is a way to isolate in-between the datasets in the registry?
yes, you can always update only a single dataset and do git commit dataset1.dvc + git push. No matter if other datasets are updated or not or if you did dvc push for other datasets. This way nothing is contaminated. You are sharing exactly one new version of a specific dataset. Everything else stays safe.
Btw, in your specific case, where do you store your data - cloud, NAS, SSH, something else?
Humm..., please see my comments below based on my understanding..
A person can download only single dataset to update it, no need to checkout / pull all the datasets in the registry to update a single one.
this above step is done by dvc import registry-url dataset-folder, in the project folder
yes, you can always update only a single dataset and do
git commit dataset1.dvc+git push. No matter if other datasets are updated or not or if you diddvc pushfor other datasets. This way nothing is contaminated. You are sharing exactly one new version of a specific dataset. Everything else stays safe.
Here above, the user needs to run the commands in the local registry repo, which means he/she has to clone the whole registry repo to local.
Btw, in your specific case, where do you store your data - cloud, NAS, SSH, something else?
I believe it does not matter, but for example, in cloud. if what i understand is correct as commented above, then here is what happens
- someone works in project folder and
dvc importfrom remote registry repo, and rundvc pullto get data from cloud - he/she changes the data in project,
- if he/she now wants to update the data change to registry, he/she first runs
dvc pushto copy any new data to the cloud storage. - he/she then clones the registry repo to local and copy the update dvc file from the project folder to local registry repo folder and rum
git commitandgit push
if that is correct, my only concern is that it is little tedious 'cus the user needs to switch between project and registry folder, i wonder if there could be a safer way to update the registry without leaving the project folder
if that is the way it should work, then nevermind
Here above, the user needs to run the commands in the local registry repo, which means he/she has to clone the whole registry repo to local.
That was my point. They need to do git clone data-registry (that's should be fine, right?) , but no need to do the full dvc pull. A specific dataset can be pulled - dvc pull dataset1 and then committed dvc commit dataset1 .. It can be granular.
There is a way to avoid even the second copy on a machine where they would be updating a dataset that needs to be updated. data-registry repo can be setup to point to a project repo that where dvc import was used with dvc cache dir. It's a bit more involved, but this way there will be no additional downloads, data copy, etc, etc.
That was my point. They need to do
git clone data-registry(that's should be fine, right?)
that is fine most cases, but it concerns me a little bit when some unexperienced github user cloned the registry repo and made some changes to it unexpectedly. expecially if he/she changes datasets other than the one he/she is using.
So i just wonder if there is a way to restrain the user only to make changes to the dataset he imports/gets from the data-registry. and by make changes i mean commit changes to the original data-registry. That is why i thoght it would be useful if we can directly commit the data changes (dvc file) to the data-registry within the project folder when the dataset is imported from the data-registry
I may be asking for something not have a strong reason for it, I will close this issue for now
thanks for answering my concerns though :)
Hi @shelper! Please don't take the discussion above as dismissal of your points. It's just trying to understand and gather info.
I think it makes sense that you want to update the data registry from the consumer repo, and I have heard of others asking for something similar. As you said, it forces inexperienced users to understand a lot about how to make changes in Git. It also feels broken to me that you have to manually copy either the data or at least the .dvc file/checksum from the consumer repo to the data registry repo.
I don't have a good solution for you now, but let's keep this open for discussion!
There's also a related proposal in #8066 for how to upload data back to a repo.
I second the wish for a more seamless way to upload the data. Why it's better than just copying, adding and pushing:
- less work (no need to clone the data registry repo or copy files). We have many dev servers where people run things, so they'd need to clone the registry on every server they work on.
- easier to do programmatically (especially if you have a pipeline that generates multiple files that need to be saved to the registry on every run, creating a new commit)
There is now support for a cross-repo registry of artifacts in DVC Studio that simplifies this workflow. Although it's branded as a Model Registry, it works with non-model artifacts as well, and we have discussed branding it more generically to clarify that it can work for non-models.
Of course, you may prefer an open-source solution, but Studio has several advantages over the current data registry approach that would be hard to replicate in DVC alone:
- Cross-repo: Studio provides a place to combine artifacts from all your projects, so you don't have to maintain a separate data registry repo. You can directly push updates to the consumer repos and use Studio as the centralized registry. You can also programmatically update the version numbers and/or stages of artifacts on each run.
- Metadata: DVC doesn’t have a way to search or use structured metadata fields like those supported by the registry or to search across repos, and it would be hard to do a good job with this in a CLI tool.
- Access: Studio can connect to your cloud storage and provide consumers with a temporary URL without giving them access to the entire cloud storage or requiring them to install or configure anything.