Streaming DVC imports
https://github.com/iterative/dvc/pull/10164 will introduce datasets as a new type of dependency that aren't based on the local filesystem. This same mechanism can be used to stream data from other DVC repos. Unlike dvc import, no local copy of the data is needed. Users can specify a revision, freeze it, make it a stage dependency, and stream it into their code using the DVC API.
@dberenbaum
and stream it into their code using the DVC API
could you clarify this please? what DVC API is going to be used?
That's still being worked out in https://github.com/iterative/dvc/pull/10164, but you can see there examples of how a dvc.api.dataset may work for DVCX at least. For streaming DVC imports, it could either return info like repo url, revision hash, etc. to pass to another API like DVCFilesystem, or it could be a wrapper around DVCFileSystem.
Got it. I wonder if this is needed (vs people using their own tools to access data the way they want) - should we just have a way to pass some information about the dependency to the user code? (it was asked btw in some other contexts I think).
should we just have a way to pass some information about the dependency to the user code?
Sorry, I don't think I follow how that differs from the example in #10164 where dvc.api.dataset returns the dataset name and version?
Actually, I thought there was only a DVCX example in #10164, but there's also one for streaming DVC imports that looks like this:
from dvc.api.dataset import DVCDataset, get
from dvc.fs.dvc import DVCFileSystem
resolved = get(DVCDataset, "stackoverflow")
fs = DVCFileSystem(url=resolved.url, rev=resolved.rev)
with fs.open(resolved.path) as f:
process_posts(f.readlines())
okay, I see. I got confused by For streaming DVC imports, it could either return info like repo url, revision hash, etc. to pass to another API like DVCFilesystem, or it could be a wrapper around DVCFileSystem. - but I see that this just an example for the DVC-specific deps.
Good then and makes sense. The only potential thing to look into if can be generalized with an API that provides info about the pipeline / deps in general.
The only potential thing to look into if can be generalized with an API that provides info about the pipeline / deps in general.
Good point. Related to https://github.com/iterative/dvc/issues/10179. Maybe we can combine these APIs. cc @skshetry
@skshetry Forgot that we already have this issues and #10231. Added both to the project board. Would be great to also get your thoughts on the API and whether it makes sense to combine with #10179.