dvc import-url/update: add --no-download flag

trafficstars

add --no-download flag to dvc import-url/dvc update to only create/update .dvc files without downloading the associated data

Closes #7918

Jul 18 '22 09:07 dtrifiro

@jorgeorpinel I would appreciate your feedback on the flag naming here whenever you can 🙏

Jul 18 '22 15:07 dberenbaum

Waiting for https://github.com/iterative/dvc-objects/pull/103 before merging due to a pydrive issue making CI timeout

Jul 18 '22 15:07 dtrifiro

Nice @dtrifiro! I think the expected usage here is still a little unclear, so I'll try to clarify (cc @dmpetrov):

dvc import-url --no-download foo should generate everything in the corresponding foo.dvc so that if I later run dvc pull, I can get foo without changing foo.dvc. Why this PR doesn't fully address it: a) it generates the deps info but not the outs, so actually downloading foo still changes foo.dvc; and b) there's no way to pull foo without having it saved to the DVC cache/remote (instead, it should act like dvc import where it can be pulled directly from source).
There should be some way to check if the source has changed without touching foo.dvc. This could be in dvc update with a flag like --dry-run/--no-exec (see dvc import[-url] --no-exec) or in status with some flag (should dvc status --with-deps do this already? cc @skshetry). Why this PR doesn't fully address it: it still updates the deps in foo.dvc.

Jul 18 '22 16:07 dberenbaum

@dberenbaum

dvc import-url --no-download foo should generate everything in the corresponding foo.dvc so that if I later run dvc pull, I can get foo without changing foo.dvc. Why this PR doesn't fully address it: a) it generates the deps info but not the outs, so actually downloading foo still changes foo.dvc; and b) there's no way to pull foo without having it saved to the DVC cache/remote (instead, it should act like dvc import where it can be pulled directly from source).

Not sure I agree about generating outputs here. In order to actually write the out's hash (md5), we'd have to download (stream) the data, defeating (imo) the purpose of having a --no-download flag. It is a bit strange to have this "intermediate" .dvc file containing deps info but not outs info, and this shows in status: data is shown as missing in the cache (which I'd expect) but the checksum is also shown as changed (which feels odd at first, but makes sense, since data is missing). Maybe this can be dealt with by adding a warning while waiting to implement a proper solution in data status? (cc @skshetry @efiop )

There should be some way to check if the source has changed without touching foo.dvc. [...]

I could add --dry-run (or --check-only?) flag in update, although I'm wondering whether it would more appropriate to have this in data status as well?

Jul 20 '22 13:07 dtrifiro

Not sure I agree about generating outputs here. In order to actually write the out's hash (md5), we'd have to download (stream) the data, defeating (imo) the purpose of having a --no-download flag.

How are we getting the checksum for the deps section without streaming it?

I see your point, but avoiding streaming the file is not a hard requirement here (that's what --no-exec is for). The only hard requirements are to capture the checksums for versioning, not save to disk, and be able to pull later (maybe --no-local is a better name than --no-download). Open to more discussion on this, but wanted to make clear the requirements.

Some other thoughts:

It's like import-url --to-remote, except we shouldn't need to push to the remote if we have a static dataset (or in the future, if we record the cloud version ID).
Does it make sense to think about import in addition to import-url?

I could add --dry-run (or --check-only?) flag in update, although I'm wondering whether it would more appropriate to have this in data status as well?

Good point, data status seems like an appropriate place for this. @skshetry What do you think (it shouldn't be part of the initial PR)?

Jul 21 '22 13:07 dberenbaum

How are we getting the checksum for the deps section without streaming it?

We're not: depending on the remote we use various types of metadata (md5, etag, checksum, ...) to check whether the remote file has changed and we need to update it. Running dvc update after the initial import-url --no-download actually downloads the file, computes the md5 and updates the .dvc file with the computed md5.

Some clouds do offer the possibility of getting the md5 checksum (directly or through the etag), but this is not always available/consistent (e.g. on S3 it depends on whether the bucket is encrypted, whether the file has been uploaded with multipart upload).

The only hard requirements are to capture the checksums for versioning, not save to disk, and be able to pull later. This works already, although one has to use dvc update as dvc pull will not currently work since the md5 hash is not saved in the outs. Would this work?

When running dvc status after dvc import-url --no-download ..., we get something like this:

WARNING: stage: 'data.dvc' is frozen. Its dependencies are not going to be shown in the status output.
testfiles3.dvc:
	changed outs:
		deleted:            data
	changed checksum

Streaming the file to compute the md5, would get rid of changed checksum and make it possible to use dvc pull.

Jul 21 '22 14:07 dtrifiro

Thanks @dtrifiro!

The only hard requirements are to capture the checksums for versioning, not save to disk, and be able to pull later.

This works already, although one has to use dvc update as dvc pull will not currently work since the md5 hash is not saved in the outs. Would this work?

It seems like update isn't right here since I don't want to get the latest version but the frozen version I imported.

If we start saving the cloud version ID, a typical workflow would be:

Use Studio to add a model via import-url --no-download.
Overwrite that model with a second version on the cloud.
Get a copy of the first version that was added via import-url --no-download.

Is there a way to accomplish 3? Or, since we don't have cloud version ID support yet, is there a way to try to get the file and fail if it doesn't match the checksum?

When running dvc status after dvc import-url --no-download ..., we get something like this:
WARNING: stage: 'data.dvc' is frozen. Its dependencies are not going to be shown in the status output.
testfiles3.dvc:
	changed outs:
		deleted:            data
	changed checksum
Streaming the file to compute the md5, would get rid of changed checksum and make it possible to use dvc pull.

We at least need a way to show when the source contents change, which is what's missing now.

Jul 21 '22 15:07 dberenbaum

Discussed with @dtrifiro that there are several options.

Use dvc repro and fail if etags don't match. However, if a user does dvc pull, they will get an error that the file can't be found, which could be unexpected.
Use dvc pull to do the same but without relying on md5 lookup.
Use new flag to import-url like --download/--finish to complete but completely different syntax.

We can show how 1 works to confirm the workflow makes sense, and then prototype 2 since it seems users will expect this to work in dvc pull.

This touches on aspects of the cloud versioning discussion:

Needing cloud version IDs to authoritatively track the source data.
Needing to pull files without a cache entry.

cc @efiop @dmpetrov

Aug 03 '22 15:08 dberenbaum

I'm more inclined to get 1 and 2 (repro/pull) working. repro is easy, although I'm hoping to find a non-hacky way to also get it to work with pull.

As @dberenbaum mentioned, I getting pull to work would is closely related to the cloud versioning scenario: in both cases we'd need some other logic to retrieve files when we don't have them in an ObjectDB cache

Aug 03 '22 17:08 dtrifiro

we'd have to download (stream) the data, defeating (imo) the purpose of having a --no-download flag.

Yes, not spending time for the download is the core requirement here. I see 3 options:

Avoid using md5s for such imports. Fully relay on etags.
dvc pull modify .dvc file
dvc pull does not modify .dvc but temporary saves it somewhere

(1) is preferred. (2) & (3) are ok options.

Running dvc update after the initial import-url --no-download actually downloads the file, computes the md5 and updates the .dvc file with the computed md5.

Yes dvc import-url = dvc import-url --no-download & dvc update

Also, we need to be careful with dvc status and dvc exp run as Dave mentioned.

Aug 05 '22 20:08 dmpetrov

Yes dvc import-url = dvc import-url --no-download & dvc update

@dmpetrov I think we want dvc import-url = dvc import-url --no-download & dvc pull? In other words, we don't want to update the etag if it has changed. Instead, we want to fail in that case, right?

Aug 05 '22 21:08 dberenbaum

With the latest push, we have dvc pull support for what I call "partial imports" (imports which are missing outs checksums).

Running dvc pull downloads the partially-imported data, raising an error if the etag changed and also updates the .dvc file with outs and stage metadata. This could be changed so that the user has to explicitly dvc commit: modifying .dvc files on pull might be odd, although it also would be good to have pull "complete" the partial import with the user only having to git commit the .dvc file, without having to add another flag such as import-url --finish. I guess this really depends on the workflow though.

Yes dvc import-url = dvc import-url --no-download & dvc update

@dmpetrov Currently dvc import-url = dvc import-url --no-download && dvc pull, running update, as @dberenbaum mentioned, would download the latest version of the file at the given URL and update the .dvc file with the most recent etag/checksums, whereas pull currently fails if the remote file changed compared to the first import.

Aug 08 '22 17:08 dtrifiro

@dtrifiro Just tested this and it looks good! I put some comment above about naming.

The only other issue is that update --no-download currently leaves the outs section intact (and the workspace) so that you can end up with v1 in outs and v2 in deps. Later, if you do dvc pull, you will get v1. I think we should delete any outs info other than the path when doing update --no-download.

Aug 09 '22 16:08 dberenbaum

@pmrowla Could you please take a look as well? Since you are working on #8164 which is related.

Aug 23 '22 07:08 efiop

What is the expected behavior of dvc commit when a repo has partial imports? Currently it will fail because there is no output file in the workspace

$ cat file.dvc
md5: 750c9e509a218cbbcde864ce15515c0d
frozen: true
deps:
- etag: '0x8DA79D118EC7839'
  size: 2
  path: remote://azure/dir/file
outs:
- path: file

$ dvc commit
outputs ['file'] of stage: 'file.dvc' changed. Are you sure you want to commit it? [y/n] y
ERROR: failed to commit - output 'file' does not exist

It seems to me that committing a partial import should just be a no-op and silently pass?

Aug 23 '22 09:08 pmrowla

Also, this is semi-related to #8164, but IMO using dvc pull here to download and complete/un-partial the imports gets confusing, because subsequent dvc pulls do not download the import from it's original source location. So dvc pull only works as a substitute for dvc update-with-etag-verification that one time. And for all subsequent downloads the user has to use dvc update anyways.

DVC will try to pull it from a regular remote (and dvc pull will outright fail if your repo does not have a remote configured).

Basically, given this scenario, I don't think it's obvious to the user why the 2nd and 3rd dvc pull's fail, but the final dvc update succeeds:

$ dvc import-url "https://raw.githubusercontent.com/iterative/dvc/main/README.rst" --no-download
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'
...

# pull to complete the import (succeeds, downloads from original https://... source URL)
$ dvc pull
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'
1 file fetched

# remove the output and cache
$ rm -rf README.rst .dvc/cache                                                                                                                              ⏎

# pull with no remote configured (fails due to no remote configured)
$ dvc pull
ERROR: failed to pull data from the cloud - config file error: no remote specified. Create a default remote with
    dvc remote add -d <remote name> <remote url>

# add a dummy remote
$ dvc remote add -d empty-remote ../empty                                                                                                                   ⏎
Setting 'empty-remote' as a default remote.

# pull with remote configured (fails because it tries to download cached object from dummy remote and not original import URL)
$ dvc pull
WARNING: Some of the cache files do not exist neither locally nor on remote. Missing cache files:
name: README.rst, md5: ff649e3dae3038e81076e6e37dc7f57f
1 file failed
ERROR: failed to pull data from the cloud - Checkout failed for following targets:
/Users/pmrowla/git/scratch/tmp/README.rst
Is your cache up to date?
<https://error.dvc.org/missing-files>

# use update (succeeds, downloads latest version from original https://... source URL)
$ dvc update README.rst.dvc                                                                                                                               ⏎
Importing 'https://raw.githubusercontent.com/iterative/dvc/main/README.rst' -> 'README.rst'

Aug 23 '22 10:08 pmrowla

What is the expected behavior of dvc commit when a repo has partial imports? Currently it will fail because there is no output file in the workspace
$ cat file.dvc
md5: 750c9e509a218cbbcde864ce15515c0d
frozen: true
deps:
- etag: '0x8DA79D118EC7839'
  size: 2
  path: remote://azure/dir/file
outs:
- path: file

$ dvc commit
outputs ['file'] of stage: 'file.dvc' changed. Are you sure you want to commit it? [y/n] y
ERROR: failed to commit - output 'file' does not exist
It seems to me that committing a partial import should just be a no-op and silently pass?

What if there was something in the workspace?

Aug 23 '22 11:08 dberenbaum

What if there was something in the workspace?

It will give you the "output has changed are you sure you want to commit" prompt, and if you commit it we overwrite the outs section of the import with whatever is in the workspace. This is the same behavior you get if you try to commit existing imports though (whether or not you are using --no-download/partial imports)

Aug 23 '22 11:08 pmrowla

For the record: discussed with @dberenbaum and @pmrowla that the commit part should be fixed in this PR, but the pull-related logic is more about import-url in general and we can get back to it in a followup.

Aug 23 '22 19:08 efiop

@efiop Is commit logic even needed for this PR? Since commit fails now, it shouldn't introduce any breaking changes if we later start making commit a noop.

Aug 23 '22 19:08 dberenbaum

@dberenbaum Good point, for some reason I thought this wasn't broken before. Yeah, let's merge and move on then.

Thanks for a great discussion folks, the initial implementation seemed deceptively simple, but you did a great job here figuring it out properly. 🙏

Aug 23 '22 20:08 efiop

dvc dvc copied to clipboard

import-url/update: add --no-download flag

dvc
dvc copied to clipboard