dvc
dvc copied to clipboard
add support for tracking remote dataset
Lot of things are still missing:
- Pipelines don't yet work.
- No support for
dvc stage add -d dvcx://yet. - Can only update
dvc.yamlanddvc.lockin current working directory. - No API to get datasets from the repository, which will be necessary for Studio support.
- CLI is rudimentary.
UX looks good!
There have been discussions before about consolidating the import syntax (cc @efiop) and wondering if we should take the opportunity to do it here. For example:
$ dvc ds add dvcx://dogs@v3
$ dvc ds add s3://cloud-versioning-demo
$ dvc ds add dvc://[email protected]:iterative/example-get-started.git@main#path=data/data.xml
Notes from discussion with @skshetry:
- Require a name for now
- Treat datasets as frozen -- only possible to update using
dvc ds update - Include version info only in
dvc.lock, notdvc.yaml - Revisit https://github.com/iterative/studio/issues/8117
Codecov Report
Attention: Patch coverage is 38.98734% with 241 lines in your changes are missing coverage. Please review.
Project coverage is 90.07%. Comparing base (
89537a7) to head (f2db666).
| Files | Patch % | Lines |
|---|---|---|
| dvc/repo/datasets.py | 44.39% | 133 Missing and 1 partial :warning: |
| dvc/commands/dataset.py | 28.57% | 75 Missing :warning: |
| dvc/dvcfile.py | 24.00% | 19 Missing :warning: |
| dvc/dependency/dataset.py | 7.14% | 13 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## main #10287 +/- ##
==========================================
- Coverage 90.59% 90.07% -0.52%
==========================================
Files 493 495 +2
Lines 37774 38161 +387
Branches 5460 5532 +72
==========================================
+ Hits 34221 34375 +154
- Misses 2919 3153 +234
+ Partials 634 633 -1
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
@skshetry What's the status of this?
This should be ready for review. The above asciinema is stale (although the only change is removal of --type with dvcx://, dvc:// and dvc+https:// schema). And support for pipelines with ds:// scheme.
eg:
dvc stage add -ntest -d ds://dataset_name "command"
To import data from dvc/git repositories, you'll need dvc:// schema to the repo url, e.g:
dvc://[email protected]/iterative/example-get-started.git
dvc+https://github.com/iterative/example-get-started.git
I'll try to update the asciinema recording tomorrow, and add more descriptions in the PR.
Updated the asciinema recording above.
The datasets section implemented in this PR is as a list in dvc.yaml. metrics/params/plots are also a list.
But the artifacts section is a dictionary, which has a similar schema/format as datasets with type/path etc.
I find lists simpler and easier to use, with fewer indentations. But I am still undecided.
So, I'll keep an option open for changing the datasets schema until we document it.
Nice @skshetry! It works well for me in doing QA over various different types. I mostly have UI questions.
$ dvc ds add dvc://[email protected]:iterative/example-get-started.git@main#path=data/data.xml
Do you have thoughts on trying to incorporate the path and rev into the url somehow like this? It feels like we are now stuck needing both complex CLI options (--rev and --path) and a complex URL format (dvc+https://github.com...). I think we might be better either going all the way in on the URL format or else reverting back to --type (but then we still need --rev/--path only for dvc datasets).
Other questions:
- Should
--urlbe a positional arg since it's required? - For https and hdfs, how do we version them and what's the intended use? I tried
dvc ds add --url https://dvc.org/ --name dvc-websiteand gotERROR: unexpected error - 'HTTPSFileSystem' object has no attribute 'coalesce_version' remote://remote_name/path/to/file/or/dir (seedvc remote)- is there an example of how to use this?
$ dvc ds add dvc://[email protected]:iterative/example-get-started.git@main#path=data/data.xml
I am not against supporting this. But I find this url to be very complex, and I doubt if I'll remember it when I need it.
Even if we do support both, I'd prefer keeping --rev/--path. I don't think it is complex or complicated to support.
Also, parsing these scp-style path is difficult IIUC as git@ is optional (cc @iterative/dvc, maybe you have some ideas). One idea that I have is to convert these scp-paths into ssh-paths and then parse, but I am not sure how robust this is, i.e
- [email protected]:iterative/example-get-started.git
+ ssh://[email protected]/iterative/example-get-started.git
- Should
--urlbe a positional arg since it's required?
yeah, both name and url should be a positional argument.
- For https and hdfs, how do we version them and what's the intended use? I tried
dvc ds add --url https://dvc.org/ --name dvc-websiteand gotERROR: unexpected error - 'HTTPSFileSystem' object has no attribute 'coalesce_version'
I am not sure about the intended use case. It might be useful in Studio to show if the source changed between commits from the metadata, but I am not sure how useful that information is. We could disable it for now.
remote://remote_name/path/to/file/or/dir (seedvc remote)- is there an example of how to use this?
This should just work without doing anything.
dvc remote add myremote s3://bucketname/prefix
dvc ds add --url remote://myremote/path/to/file/or/dir --name name
I am not against supporting this. But I find this url to be very complex, and I doubt if I'll remember it when I need it.
Yup, it might be worse, especially if you think we should keep --path and --rev. Do you think we should revert to using --type then?
- For https and hdfs, how do we version them and what's the intended use? I tried
dvc ds add --url https://dvc.org/ --name dvc-websiteand gotERROR: unexpected error - 'HTTPSFileSystem' object has no attribute 'coalesce_version'I am not sure about the intended use case. It might be useful in Studio to show if the source changed between commits from the metadata, but I am not sure how useful that information is. We could disable it for now.
Does it include a hash in dvc.lock? If the dataset changes, what happens if I do dvc repro with the old dataset?
Yup, it might be worse, especially if you think we should keep --path and --rev. Do you think we should revert to using --type then?
Is --url dvc://[email protected]:iterative/example-get-started.git worse than --url [email protected]:iterative/example-get-started.git --type dvc?
Does it include a hash in
dvc.lock?
It has an etag, but we don't have a hash of the content (eg: md5). To hash the content, you'd have to stream them from the remote.
If the dataset changes, what happens if I do
dvc reprowith the old dataset?
Since the import is frozen on dvc ds add or dvc ds update, we don't run any kind of checks during dvc repro. So it does not work with non-cloud-versioned remotes at all.
And non-cloud versioned remotes can not be frozen at all, so this will require a different solution, in which we check the sources during dvc repro, and update dvc.lock accordingly. This does not fit how the "dataset" is implemented, which requires to be frozen.
I don't see a good reason to support non-versioned types like https, hdfs, and maybe even remote (I guess it depends if it's version_aware). It seems like external dependencies already support this scenario in a simpler way unless I'm missing something.
I'll disable non-cloud-versioned remotes/filesystems in the successive PRs.
Can you drop them from the parser help text in this PR? I was focused on that more than actually disabling them.
Edit: Or is there some case where https or hdfs is a valid type?
Can you drop them from the parser help text in this PR? I was focused on that more than actually disabling them.
I have dropped them from the help text.