dvc icon indicating copy to clipboard operation
dvc copied to clipboard

add support for tracking remote dataset

Open skshetry opened this issue 1 year ago • 2 comments

Lot of things are still missing:

  1. Pipelines don't yet work.
  2. No support for dvc stage add -d dvcx:// yet.
  3. Can only update dvc.yaml and dvc.lock in current working directory.
  4. No API to get datasets from the repository, which will be necessary for Studio support.
  5. CLI is rudimentary.

asciicast

skshetry avatar Feb 07 '24 13:02 skshetry

UX looks good!

There have been discussions before about consolidating the import syntax (cc @efiop) and wondering if we should take the opportunity to do it here. For example:

$ dvc ds add dvcx://dogs@v3
$ dvc ds add s3://cloud-versioning-demo
$ dvc ds add dvc://[email protected]:iterative/example-get-started.git@main#path=data/data.xml

dberenbaum avatar Feb 07 '24 22:02 dberenbaum

Notes from discussion with @skshetry:

  • Require a name for now
  • Treat datasets as frozen -- only possible to update using dvc ds update
  • Include version info only in dvc.lock, not dvc.yaml
  • Revisit https://github.com/iterative/studio/issues/8117

dberenbaum avatar Feb 08 '24 14:02 dberenbaum

Codecov Report

Attention: Patch coverage is 38.98734% with 241 lines in your changes are missing coverage. Please review.

Project coverage is 90.07%. Comparing base (89537a7) to head (f2db666).

Files Patch % Lines
dvc/repo/datasets.py 44.39% 133 Missing and 1 partial :warning:
dvc/commands/dataset.py 28.57% 75 Missing :warning:
dvc/dvcfile.py 24.00% 19 Missing :warning:
dvc/dependency/dataset.py 7.14% 13 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #10287      +/-   ##
==========================================
- Coverage   90.59%   90.07%   -0.52%     
==========================================
  Files         493      495       +2     
  Lines       37774    38161     +387     
  Branches     5460     5532      +72     
==========================================
+ Hits        34221    34375     +154     
- Misses       2919     3153     +234     
+ Partials      634      633       -1     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Feb 20 '24 13:02 codecov[bot]

@skshetry What's the status of this?

dberenbaum avatar Feb 21 '24 14:02 dberenbaum

This should be ready for review. The above asciinema is stale (although the only change is removal of --type with dvcx://, dvc:// and dvc+https:// schema). And support for pipelines with ds:// scheme.

eg:

dvc stage add -ntest -d ds://dataset_name "command"

To import data from dvc/git repositories, you'll need dvc:// schema to the repo url, e.g:

dvc://[email protected]/iterative/example-get-started.git
dvc+https://github.com/iterative/example-get-started.git

I'll try to update the asciinema recording tomorrow, and add more descriptions in the PR.

skshetry avatar Feb 21 '24 14:02 skshetry

Updated the asciinema recording above.

skshetry avatar Feb 21 '24 14:02 skshetry

The datasets section implemented in this PR is as a list in dvc.yaml. metrics/params/plots are also a list. But the artifacts section is a dictionary, which has a similar schema/format as datasets with type/path etc.

I find lists simpler and easier to use, with fewer indentations. But I am still undecided. So, I'll keep an option open for changing the datasets schema until we document it.

skshetry avatar Feb 22 '24 17:02 skshetry

Nice @skshetry! It works well for me in doing QA over various different types. I mostly have UI questions.

$ dvc ds add dvc://[email protected]:iterative/example-get-started.git@main#path=data/data.xml

Do you have thoughts on trying to incorporate the path and rev into the url somehow like this? It feels like we are now stuck needing both complex CLI options (--rev and --path) and a complex URL format (dvc+https://github.com...). I think we might be better either going all the way in on the URL format or else reverting back to --type (but then we still need --rev/--path only for dvc datasets).

Other questions:

  • Should --url be a positional arg since it's required?
  • For https and hdfs, how do we version them and what's the intended use? I tried dvc ds add --url https://dvc.org/ --name dvc-website and got ERROR: unexpected error - 'HTTPSFileSystem' object has no attribute 'coalesce_version'
  • remote://remote_name/path/to/file/or/dir (see dvc remote) - is there an example of how to use this?

dberenbaum avatar Feb 23 '24 15:02 dberenbaum

$ dvc ds add dvc://[email protected]:iterative/example-get-started.git@main#path=data/data.xml

I am not against supporting this. But I find this url to be very complex, and I doubt if I'll remember it when I need it.

Even if we do support both, I'd prefer keeping --rev/--path. I don't think it is complex or complicated to support.

Also, parsing these scp-style path is difficult IIUC as git@ is optional (cc @iterative/dvc, maybe you have some ideas). One idea that I have is to convert these scp-paths into ssh-paths and then parse, but I am not sure how robust this is, i.e

- [email protected]:iterative/example-get-started.git
+ ssh://[email protected]/iterative/example-get-started.git
  • Should --url be a positional arg since it's required?

yeah, both name and url should be a positional argument.

  • For https and hdfs, how do we version them and what's the intended use? I tried dvc ds add --url https://dvc.org/ --name dvc-website and got ERROR: unexpected error - 'HTTPSFileSystem' object has no attribute 'coalesce_version'

I am not sure about the intended use case. It might be useful in Studio to show if the source changed between commits from the metadata, but I am not sure how useful that information is. We could disable it for now.

  • remote://remote_name/path/to/file/or/dir (see dvc remote) - is there an example of how to use this?

This should just work without doing anything.

dvc remote add myremote s3://bucketname/prefix
dvc ds add --url remote://myremote/path/to/file/or/dir --name name

skshetry avatar Feb 23 '24 16:02 skshetry

I am not against supporting this. But I find this url to be very complex, and I doubt if I'll remember it when I need it.

Yup, it might be worse, especially if you think we should keep --path and --rev. Do you think we should revert to using --type then?

  • For https and hdfs, how do we version them and what's the intended use? I tried dvc ds add --url https://dvc.org/ --name dvc-website and got ERROR: unexpected error - 'HTTPSFileSystem' object has no attribute 'coalesce_version'

I am not sure about the intended use case. It might be useful in Studio to show if the source changed between commits from the metadata, but I am not sure how useful that information is. We could disable it for now.

Does it include a hash in dvc.lock? If the dataset changes, what happens if I do dvc repro with the old dataset?

dberenbaum avatar Feb 23 '24 17:02 dberenbaum

Yup, it might be worse, especially if you think we should keep --path and --rev. Do you think we should revert to using --type then?

Is --url dvc://[email protected]:iterative/example-get-started.git worse than --url [email protected]:iterative/example-get-started.git --type dvc?

Does it include a hash in dvc.lock?

It has an etag, but we don't have a hash of the content (eg: md5). To hash the content, you'd have to stream them from the remote.

If the dataset changes, what happens if I do dvc repro with the old dataset?

Since the import is frozen on dvc ds add or dvc ds update, we don't run any kind of checks during dvc repro. So it does not work with non-cloud-versioned remotes at all.

And non-cloud versioned remotes can not be frozen at all, so this will require a different solution, in which we check the sources during dvc repro, and update dvc.lock accordingly. This does not fit how the "dataset" is implemented, which requires to be frozen.

skshetry avatar Feb 23 '24 17:02 skshetry

I don't see a good reason to support non-versioned types like https, hdfs, and maybe even remote (I guess it depends if it's version_aware). It seems like external dependencies already support this scenario in a simpler way unless I'm missing something.

dberenbaum avatar Feb 23 '24 17:02 dberenbaum

I'll disable non-cloud-versioned remotes/filesystems in the successive PRs.

skshetry avatar Feb 23 '24 17:02 skshetry

Can you drop them from the parser help text in this PR? I was focused on that more than actually disabling them.

Edit: Or is there some case where https or hdfs is a valid type?

dberenbaum avatar Feb 23 '24 17:02 dberenbaum

Can you drop them from the parser help text in this PR? I was focused on that more than actually disabling them.

I have dropped them from the help text.

skshetry avatar Feb 24 '24 07:02 skshetry