dvc
dvc copied to clipboard
`dvc push` doesn't update cloud info with cloud versioned remotes
Context: https://github.com/iterative/dvc/discussions/9907#discussioncomment-6990738
Let's say we have two remotes:
['remote "dev"']
url = s3://yury-cloud-versioning-test/test-dev
version_aware=true
['remote "prod"']
url = s3://yury-cloud-versioning-test/test-prod
version_aware=true
Let's say we need to migrate data from one to another.
I would expect commands like this:
dvc pull -r dev
dvc push -r prod
to work and update .dvc \ dvc.lock files with an appropriate info, in reality I'm getting:
(.venv) √ Projects/test-cloud-versioned % git diff
diff --git a/test.txt.dvc b/test.txt.dvc
index d5fbaa9..03f455c 100644
--- a/test.txt.dvc
+++ b/test.txt.dvc
@@ -4,6 +4,6 @@ outs:
hash: md5
path: test.txt
cloud:
- dev:
+ prod:
etag: d8e8fca2dc0f896fd7cb4cb0031ba249
version_id: UK3s0VcueuAIttMw7FROG8pRospYWNQI
(only remote name is updated, info stays the same, which is wrong for that remote).
In the original issue, even the object is not pushed to the new remote.
Also, in case of cloud versioning I think prod / dev don't make much sense in .dvc \ dvc.lock. Version_id is unique (I assume) and can't repeat in a different location. I guess we need to use some hash, or location itself in this case. How do we use these names at all? do we expect that specific remote name to exist in a config?
Related: https://github.com/iterative/dvc/issues/8356 https://github.com/iterative/dvc/pull/8862
@skshetry Can you remember what the expected behavior is here? Should we be overwriting the remote info? Or disallowing this operation?
How do we use these names at all? do we expect that specific remote name to exist in a config?
Yes, they are tied to the remote name defined in that git commit's .dvc/config.
Can you remember what the expected behavior is here
We should be pushing the file to the prod remote and updating the version ID and etag here.
I'm guessing this is an index bug caused by both remotes using the same s3 bucket. It looks like the user from the original discussion/context is using dvc==3.0.0 so this may just be a duplicate of https://github.com/iterative/dvc/issues/9904 (which is fixed in the latest release)
@pmrowla I think I was able to reproduce it on the recent DVC version.
I can also reproduce this with a single remote by changing the remote path. DVC will update the remote name but won't notice that it needs to push again.
It seems like DVC isn't checking whether the version IDs actually exist. I remember discussing this in https://github.com/iterative/dvc/pull/8766 but I think @pmrowla rightly saw it as dangerous and we continued to check which version were available.