dvc
dvc copied to clipboard
import-url/update: add --no-download flag
add --no-download flag to dvc import-url/dvc update to only create/update .dvc files without downloading the associated data
Closes #7918
@jorgeorpinel I would appreciate your feedback on the flag naming here whenever you can 🙏
Waiting for https://github.com/iterative/dvc-objects/pull/103 before merging due to a pydrive issue making CI timeout
Nice @dtrifiro! I think the expected usage here is still a little unclear, so I'll try to clarify (cc @dmpetrov):
dvc import-url --no-download fooshould generate everything in the correspondingfoo.dvcso that if I later rundvc pull, I can getfoowithout changingfoo.dvc. Why this PR doesn't fully address it: a) it generates the deps info but not the outs, so actually downloadingfoostill changesfoo.dvc; and b) there's no way to pullfoowithout having it saved to the DVC cache/remote (instead, it should act likedvc importwhere it can be pulled directly from source).- There should be some way to check if the source has changed without touching
foo.dvc. This could be indvc updatewith a flag like--dry-run/--no-exec(seedvc import[-url] --no-exec) or instatuswith some flag (shoulddvc status --with-depsdo this already? cc @skshetry). Why this PR doesn't fully address it: it still updates the deps infoo.dvc.
@dberenbaum
- dvc import-url --no-download foo should generate everything in the corresponding foo.dvc so that if I later run dvc pull, I can get foo without changing foo.dvc. Why this PR doesn't fully address it: a) it generates the deps info but not the outs, so actually downloading foo still changes foo.dvc; and b) there's no way to pull foo without having it saved to the DVC cache/remote (instead, it should act like dvc import where it can be pulled directly from source).
Not sure I agree about generating outputs here. In order to actually write the out's hash (md5), we'd have to download (stream) the data, defeating (imo) the purpose of having a --no-download flag.
It is a bit strange to have this "intermediate" .dvc file containing deps info but not outs info, and this shows in status: data is shown as missing in the cache (which I'd expect) but the checksum is also shown as changed (which feels odd at first, but makes sense, since data is missing). Maybe this can be dealt with by adding a warning while waiting to implement a proper solution in data status? (cc @skshetry @efiop )
- There should be some way to check if the source has changed without touching foo.dvc. [...]
I could add --dry-run (or --check-only?) flag in update, although I'm wondering whether it would more appropriate to have this in data status as well?
Not sure I agree about generating outputs here. In order to actually write the out's hash (
md5), we'd have to download (stream) the data, defeating (imo) the purpose of having a--no-downloadflag.
How are we getting the checksum for the deps section without streaming it?
I see your point, but avoiding streaming the file is not a hard requirement here (that's what --no-exec is for). The only hard requirements are to capture the checksums for versioning, not save to disk, and be able to pull later (maybe --no-local is a better name than --no-download). Open to more discussion on this, but wanted to make clear the requirements.
Some other thoughts:
- It's like
import-url --to-remote, except we shouldn't need to push to the remote if we have a static dataset (or in the future, if we record the cloud version ID). - Does it make sense to think about
importin addition toimport-url?
I could add
--dry-run(or--check-only?) flag inupdate, although I'm wondering whether it would more appropriate to have this indata statusas well?
Good point, data status seems like an appropriate place for this. @skshetry What do you think (it shouldn't be part of the initial PR)?
How are we getting the checksum for the deps section without streaming it?
We're not: depending on the remote we use various types of metadata (md5, etag, checksum, ...) to check whether the remote file has changed and we need to update it.
Running dvc update after the initial import-url --no-download actually downloads the file, computes the md5 and updates the .dvc file with the computed md5.
Some clouds do offer the possibility of getting the md5 checksum (directly or through the etag), but this is not always available/consistent (e.g. on S3 it depends on whether the bucket is encrypted, whether the file has been uploaded with multipart upload).
The only hard requirements are to capture the checksums for versioning, not save to disk, and be able to pull later. This works already, although one has to use
dvc updateasdvc pullwill not currently work since the md5 hash is not saved in the outs. Would this work?
When running dvc status after dvc import-url --no-download ..., we get something like this:
WARNING: stage: 'data.dvc' is frozen. Its dependencies are not going to be shown in the status output.
testfiles3.dvc:
changed outs:
deleted: data
changed checksum
Streaming the file to compute the md5, would get rid of changed checksum and make it possible to use dvc pull.
Thanks @dtrifiro!
The only hard requirements are to capture the checksums for versioning, not save to disk, and be able to pull later.
This works already, although one has to use
dvc updateasdvc pullwill not currently work since the md5 hash is not saved in the outs. Would this work?
It seems like update isn't right here since I don't want to get the latest version but the frozen version I imported.
If we start saving the cloud version ID, a typical workflow would be:
- Use Studio to add a model via
import-url --no-download. - Overwrite that model with a second version on the cloud.
- Get a copy of the first version that was added via
import-url --no-download.
Is there a way to accomplish 3? Or, since we don't have cloud version ID support yet, is there a way to try to get the file and fail if it doesn't match the checksum?
When running
dvc statusafterdvc import-url --no-download ..., we get something like this:WARNING: stage: 'data.dvc' is frozen. Its dependencies are not going to be shown in the status output. testfiles3.dvc: changed outs: deleted: data changed checksumStreaming the file to compute the md5, would get rid of
changed checksumand make it possible to usedvc pull.
We at least need a way to show when the source contents change, which is what's missing now.
Discussed with @dtrifiro that there are several options.
- Use
dvc reproand fail if etags don't match. However, if a user doesdvc pull, they will get an error that the file can't be found, which could be unexpected. - Use
dvc pullto do the same but without relying on md5 lookup. - Use new flag to
import-urllike--download/--finishto complete but completely different syntax.
We can show how 1 works to confirm the workflow makes sense, and then prototype 2 since it seems users will expect this to work in dvc pull.
This touches on aspects of the cloud versioning discussion:
- Needing cloud version IDs to authoritatively track the source data.
- Needing to pull files without a cache entry.
cc @efiop @dmpetrov
I'm more inclined to get 1 and 2 (repro/pull) working. repro is easy, although I'm hoping to find a non-hacky way to also get it to work with pull.
As @dberenbaum mentioned, I getting pull to work would is closely related to the cloud versioning scenario: in both cases we'd need some other logic to retrieve files when we don't have them in an ObjectDB cache
we'd have to download (stream) the data, defeating (imo) the purpose of having a
--no-downloadflag.
Yes, not spending time for the download is the core requirement here. I see 3 options:
- Avoid using md5s for such imports. Fully relay on etags.
dvc pullmodify.dvcfiledvc pulldoes not modify.dvcbut temporary saves it somewhere
(1) is preferred. (2) & (3) are ok options.
Running
dvc updateafter the initialimport-url --no-downloadactually downloads the file, computes the md5 and updates the.dvcfile with the computed md5.
Yes dvc import-url = dvc import-url --no-download & dvc update
Also, we need to be careful with dvc status and dvc exp run as Dave mentioned.
Yes
dvc import-url = dvc import-url --no-download & dvc update
@dmpetrov I think we want dvc import-url = dvc import-url --no-download & dvc pull? In other words, we don't want to update the etag if it has changed. Instead, we want to fail in that case, right?
With the latest push, we have dvc pull support for what I call "partial imports" (imports which are missing outs checksums).
Running dvc pull downloads the partially-imported data, raising an error if the etag changed and also updates the .dvc file with outs and stage metadata.
This could be changed so that the user has to explicitly dvc commit: modifying .dvc files on pull might be odd, although it also would be good to have pull "complete" the partial import with the user only having to git commit the .dvc file, without having to add another flag such as import-url --finish. I guess this really depends on the workflow though.
Yes dvc import-url = dvc import-url --no-download & dvc update
@dmpetrov
Currently dvc import-url = dvc import-url --no-download && dvc pull, running update, as @dberenbaum mentioned, would download the latest version of the file at the given URL and update the .dvc file with the most recent etag/checksums, whereas pull currently fails if the remote file changed compared to the first import.
@dtrifiro Just tested this and it looks good! I put some comment above about naming.
The only other issue is that update --no-download currently leaves the outs section intact (and the workspace) so that you can end up with v1 in outs and v2 in deps. Later, if you do dvc pull, you will get v1. I think we should delete any outs info other than the path when doing update --no-download.
@pmrowla Could you please take a look as well? Since you are working on #8164 which is related.
What is the expected behavior of dvc commit when a repo has partial imports? Currently it will fail because there is no output file in the workspace
$ cat file.dvc
md5: 750c9e509a218cbbcde864ce15515c0d
frozen: true
deps:
- etag: '0x8DA79D118EC7839'
size: 2
path: remote://azure/dir/file
outs:
- path: file
$ dvc commit
outputs ['file'] of stage: 'file.dvc' changed. Are you sure you want to commit it? [y/n] y
ERROR: failed to commit - output 'file' does not exist
It seems to me that committing a partial import should just be a no-op and silently pass?
Also, this is semi-related to #8164, but IMO using dvc pull here to download and complete/un-partial the imports gets confusing, because subsequent dvc pulls do not download the import from it's original source location. So dvc pull only works as a substitute for dvc update-with-etag-verification that one time. And for all subsequent downloads the user has to use dvc update anyways.
DVC will try to pull it from a regular remote (and dvc pull will outright fail if your repo does not have a remote configured).
Basically, given this scenario, I don't think it's obvious to the user why the 2nd and 3rd dvc pull's fail, but the final dvc update succeeds:
$ dvc import-url "https://raw.githubusercontent.com/iterative/dvc/main/README.rst" --no-download
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'
...
# pull to complete the import (succeeds, downloads from original https://... source URL)
$ dvc pull
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'
1 file fetched
# remove the output and cache
$ rm -rf README.rst .dvc/cache ⏎
# pull with no remote configured (fails due to no remote configured)
$ dvc pull
ERROR: failed to pull data from the cloud - config file error: no remote specified. Create a default remote with
dvc remote add -d <remote name> <remote url>
# add a dummy remote
$ dvc remote add -d empty-remote ../empty ⏎
Setting 'empty-remote' as a default remote.
# pull with remote configured (fails because it tries to download cached object from dummy remote and not original import URL)
$ dvc pull
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
name: README.rst, md5: ff649e3dae3038e81076e6e37dc7f57f
1 file failed
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
/Users/pmrowla/git/scratch/tmp/README.rst
Is your cache up to date?
<https://error.dvc.org/missing-files>
# use update (succeeds, downloads latest version from original https://... source URL)
$ dvc update README.rst.dvc ⏎
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'
What is the expected behavior of
dvc commitwhen a repo has partial imports? Currently it will fail because there is no output file in the workspace$ cat file.dvc md5: 750c9e509a218cbbcde864ce15515c0d frozen: true deps: - etag: '0x8DA79D118EC7839' size: 2 path: remote://azure/dir/file outs: - path: file $ dvc commit outputs ['file'] of stage: 'file.dvc' changed. Are you sure you want to commit it? [y/n] y ERROR: failed to commit - output 'file' does not existIt seems to me that committing a partial import should just be a no-op and silently pass?
What if there was something in the workspace?
What if there was something in the workspace?
It will give you the "output has changed are you sure you want to commit" prompt, and if you commit it we overwrite the outs section of the import with whatever is in the workspace. This is the same behavior you get if you try to commit existing imports though (whether or not you are using --no-download/partial imports)
For the record: discussed with @dberenbaum and @pmrowla that the commit part should be fixed in this PR, but the pull-related logic is more about import-url in general and we can get back to it in a followup.
@efiop Is commit logic even needed for this PR? Since commit fails now, it shouldn't introduce any breaking changes if we later start making commit a noop.
@dberenbaum Good point, for some reason I thought this wasn't broken before. Yeah, let's merge and move on then.
Thanks for a great discussion folks, the initial implementation seemed deceptively simple, but you did a great job here figuring it out properly. 🙏