dvc
dvc copied to clipboard
DVCX datasets
Does streaming dataset affect us in any way? I assume that'll be in user's code rather than dvc (and cached by dvcx). Is that just a dependency support without materializing?
Okay, looks like for streaming datasets, we will need some APIs to redirect a tracked dataset in dvc to the appropriate dataset in dql.
Discussed with @dmpetrov about this. He does not want dvc to compute md5 hashes for pipeline checksums, and wants that to be handled by dql itself, as in ask dql for workspace status.
His point was that there is: database, support for returning different formats of data like json pairs/coco and the actual files in "dvcx datasets", that only dql has a good knowledge about.
This would require us to decouple checksums that we use for pipelines and md5 hash that we use for data management.
There is also an open question of how dvc pull/dvc checkout, etc would work. And we also need to support non-instantiated/non-materialized datasets as checkout/pull is not always possible and training might happen on streaming datasets. So we also want to have a support for a dependency in your stage for pipelines.
This would be a matter of figuring out how the API will work, and how the translation works.
dvc stage add -d ds://dogs@v1 ...
(For this, we'll need to figure out how the hypothetical get_dataset() api in dvc will work and how we'll pass the data inside).
I think we should do a single integration (or start with just one). Given the challenges with former, maybe it's better to start with the latter one.
Thinking about another alternative, maybe we should create a top-level datasets in dvc.yaml where users register their versions/datasets, and then they can reference them in dvc stage add?
Eg:
$ dvc dataset add ds://dogs@v2
$ cat dvc.yaml
datasets:
- ds://dogs@v2
$ dvc stage add -d ds://dogs ...
from dvc.api import dataset
dataset.get("ds://dogs") # gets redirected to "ds://dogs@v2" from `datasets` in dvc.yaml
Passing data through envvars won't work if you want to let users do python script.py outside of dvc repro/dvc exp run/dvclive.
~~Also, I am not quite convinced we need a stage-level dependency here, since there are not going to be multiple versions of the same dataset, and it's already registered globally.~~
(We'll need to know if it changed or not to rerun stages).
Just sent an invite for the 3 of us to discuss next week.
There is also an open question of how
dvc pull/dvc checkout, etc would work. And we also need to support non-instantiated/non-materialized datasets as checkout/pull is not always possible and training might happen on streaming datasets. So we also want to have a support for a dependency in your stage for pipelines.
So these are the possible approaches I see for accessing DVCX datasets. Does it look correct to you two?
This corresponds to phase 1 and phase 2 in notion.
@skshetry Are you saying we should focus on approach 2?
In the second diagram, dvc will have to act as an intermediary.
The dataset info needs to be tracked in dvc's metadata, not in the source code, so that when users update the metadata (via Studio or through CLI), dvc will retrigger stages that changed.
At the same time, the dataset that is being referenced in source code should change to the newer dataset, and should be used for training.
Users could update in both places - in source code and in metadata, but making this one-click as possible - in Studio and in CLI, and be able to retrigger your pipelines and run your changes seems to be a feature that we want here.
Are you saying we should focus on approach 2?
I am saying that we should start with one integration first.
Regarding the first approach, there is a strict requirement here not to touch the dataset there and also technical challenges here. Given this, I am not sure if dvc needs to be even involved?
Can users use dql to materialize it? Can it be just a dependency or an external dependency from dvc's point of view, and make dql the command to manipulate/checkout/pull datasets?
Thanks @skshetry. I agree that starting with approach 2 makes sense, just wanted to clarify that's what you meant. I also agree DVC needs awareness of the dataset, but I'm not convinced we need top-level datasets.
Within approach 2, I see two different scenarios:
a. Fixed versions -- I want to specify the exact version every time b. Latest version -- I always want to use the latest but need to snapshot it for reproducibility
2a looks simple to me. Use the DVCX API directly in your code and parametrize it with DVC:
# params.yaml
dataset:
name: dogs
version: 2
# dvc.yaml
stages:
train:
cmd: python train.py ${dataset}
ds = DatasetQuery(name=name, version=version)
2b is where we need DVC to snapshot the version, but only because the user doesn't want to specify it in code, so it can look like this:
# dvc.yaml
stages:
train:
cmd: python train.py
deps:
- ds://dogs
# dvc.lock
train:
cmd: python train.py
deps:
- path: ds://dogs
- version: v2
ds = DatasetQuery(name=dogs)
update the metadata (via Studio or through CLI)
In 2a, it can be done by updating the parameters.
In 2b, the update will happen automatically.
Thinking about another alternative, maybe we should create a top-level
datasetsin dvc.yaml where users register their versions/datasets, and then they can reference them indvc stage add?
I would prefer not to because:
- It's a more complex UX -- need another abstraction to add the dataset to DVCX first, then need to reference that with special syntax in dependencies and in code, and can't use code outside dvc
- It's less consistent with how we treat other datasets -- other dependencies don't require any dvc-specific syntax in code and don't need to be defined outside the pipeline first
- It's more work to implement
Maybe I am over optimizing here and is unnecessary, but dependencies can point to multiple dataset versions. That's what I was trying to avoid with this registration.
train:
cmd: python train.py
deps:
- path: ds://dogs
version: v2
- path: ds://dogs
version: v1
Secondly, they are stage-local. You can have a stage with a dataset of a version and a separate stage with a different version.
We could easily forbid this though (at least in that DatasetQuery/get_dataset api.
Thirdly, you'll have to update in multiple places than just one. But that's a minor issue.
@dberenbaum, how would dvc and Studio understand which dataset is being used in 2a?
Also, do you think 2b) is important? Why don't we push for pinning the dataset for stronger reproducibility?
User can update dvc.yaml when they want to update (and could be one-click from Studio if needed).
Okay, I think you've convinced me on having top-level datasets as long as we are all aligned that the focus is on approach 2. I have some thoughts on minor modifications, and I do think 2b is important as it's probably the default workflow for a lot of people, but let's take the weekend to think about it and we can follow up on details next week.
Adding some additional thoughts. Maybe you thought of some of these but didn't bother to detail yet.
dvc.yaml
$ dvc dataset add ds://dogs@v2 $ cat dvc.yaml datasets: - ds://dogs@v2 $ dvc stage add -d ds://dogs ...
Maybe we can generalize this more? We might want to reuse datasets in the future for other data sources (databases, cloud-versioned data, data from other repos). It seems like this is going in the direction of https://github.com/iterative/dvc/issues/2378, where datasets act like callback dependencies that don't rely on file hashing. WDYT about something like this (or we could keep the ds:// syntax but include dvcx somewhere)?
#dvc.yaml
dataset:
dogs:
type: dvcx
name: dogs
version: v2
stages:
train:
deps:
- dataset: dogs
Also wondering how should it look in dvc.lock?
python api
from dvc.api import dataset dataset.get("ds://dogs") # gets redirected to "ds://dogs@v2" from `datasets` in dvc.yaml
Can dataset.get() (or some method) return info about the dataset instead of actually reading the dataset? For example:
from dql.query import DatasetQuery
from dvc.api import dataset
ds = DatasetQuery(**dataset.get("dogs"))
As DVCX adds more methods and ways to stream data like pytorch data loaders, etc., we don't want to have to add more DVC API methods to support every DVCX way to load data. Users can keep their native DVCX code and only use DVC to provide the dataset info.
@dberenbaum looks great!
the direction of #2378, where
datasetsact like callback dependencies that don't rely on file hashing
Yes. It's related - you ask update status using an external tool (DVCX, DBT or a custom code). IDK how much the generalization is needed at this point.
Also wondering how should it look in
dvc.lock?
I guess, UUID or timestamp of datasets. This is needed to not mix up dogs@v2 with dogs@v2 that was created and deleted last week. Later DVCX will introduce a proper fingerprint but this is a whole separate story.
I guess, UUID or timestamp of datasets. This is needed to not mix up dogs@v2 with dogs@v2 that was created and deleted last week. Later DVCX will introduce a proper fingerprint but this is a whole separate story.
Related to #982. Is it important when creating and deleting a dataset that you can reuse the same version numbers? If we have a UUID or timestamp, can DVC still get the deleted dataset to reproduce (assuming the raw storage hasn't changed)? What is the point of version numbers if we also plan to have UUID/timestamp?
Name and version are human-readable.
It would be the same if dvc imports a file from a git repository. File name and tags are nice but it's better to use sha to a proper snapshot.