dvc icon indicating copy to clipboard operation
dvc copied to clipboard

Streaming DVC imports

Open dberenbaum opened this issue 2 years ago • 8 comments

https://github.com/iterative/dvc/pull/10164 will introduce datasets as a new type of dependency that aren't based on the local filesystem. This same mechanism can be used to stream data from other DVC repos. Unlike dvc import, no local copy of the data is needed. Users can specify a revision, freeze it, make it a stage dependency, and stream it into their code using the DVC API.

dberenbaum avatar Jan 11 '24 18:01 dberenbaum

@dberenbaum

and stream it into their code using the DVC API

could you clarify this please? what DVC API is going to be used?

shcheklein avatar Jan 11 '24 19:01 shcheklein

That's still being worked out in https://github.com/iterative/dvc/pull/10164, but you can see there examples of how a dvc.api.dataset may work for DVCX at least. For streaming DVC imports, it could either return info like repo url, revision hash, etc. to pass to another API like DVCFilesystem, or it could be a wrapper around DVCFileSystem.

dberenbaum avatar Jan 11 '24 19:01 dberenbaum

Got it. I wonder if this is needed (vs people using their own tools to access data the way they want) - should we just have a way to pass some information about the dependency to the user code? (it was asked btw in some other contexts I think).

shcheklein avatar Jan 11 '24 19:01 shcheklein

should we just have a way to pass some information about the dependency to the user code?

Sorry, I don't think I follow how that differs from the example in #10164 where dvc.api.dataset returns the dataset name and version?

dberenbaum avatar Jan 11 '24 19:01 dberenbaum

Actually, I thought there was only a DVCX example in #10164, but there's also one for streaming DVC imports that looks like this:

from dvc.api.dataset import DVCDataset, get
from dvc.fs.dvc import DVCFileSystem

resolved = get(DVCDataset, "stackoverflow")
fs = DVCFileSystem(url=resolved.url, rev=resolved.rev)
with fs.open(resolved.path) as f:
    process_posts(f.readlines())

dberenbaum avatar Jan 11 '24 19:01 dberenbaum

okay, I see. I got confused by For streaming DVC imports, it could either return info like repo url, revision hash, etc. to pass to another API like DVCFilesystem, or it could be a wrapper around DVCFileSystem. - but I see that this just an example for the DVC-specific deps.

Good then and makes sense. The only potential thing to look into if can be generalized with an API that provides info about the pipeline / deps in general.

shcheklein avatar Jan 11 '24 19:01 shcheklein

The only potential thing to look into if can be generalized with an API that provides info about the pipeline / deps in general.

Good point. Related to https://github.com/iterative/dvc/issues/10179. Maybe we can combine these APIs. cc @skshetry

dberenbaum avatar Jan 11 '24 20:01 dberenbaum

@skshetry Forgot that we already have this issues and #10231. Added both to the project board. Would be great to also get your thoughts on the API and whether it makes sense to combine with #10179.

dberenbaum avatar Jan 23 '24 13:01 dberenbaum