dvc icon indicating copy to clipboard operation
dvc copied to clipboard

DVCX datasets

Open skshetry opened this issue 2 years ago • 16 comments

skshetry avatar Nov 28 '23 03:11 skshetry

Does streaming dataset affect us in any way? I assume that'll be in user's code rather than dvc (and cached by dvcx). Is that just a dependency support without materializing?

skshetry avatar Dec 01 '23 16:12 skshetry

Okay, looks like for streaming datasets, we will need some APIs to redirect a tracked dataset in dvc to the appropriate dataset in dql.

skshetry avatar Dec 02 '23 04:12 skshetry

Discussed with @dmpetrov about this. He does not want dvc to compute md5 hashes for pipeline checksums, and wants that to be handled by dql itself, as in ask dql for workspace status. His point was that there is: database, support for returning different formats of data like json pairs/coco and the actual files in "dvcx datasets", that only dql has a good knowledge about.

This would require us to decouple checksums that we use for pipelines and md5 hash that we use for data management.

There is also an open question of how dvc pull/dvc checkout, etc would work. And we also need to support non-instantiated/non-materialized datasets as checkout/pull is not always possible and training might happen on streaming datasets. So we also want to have a support for a dependency in your stage for pipelines.

This would be a matter of figuring out how the API will work, and how the translation works.

dvc stage add -d ds://dogs@v1 ...

(For this, we'll need to figure out how the hypothetical get_dataset() api in dvc will work and how we'll pass the data inside).

I think we should do a single integration (or start with just one). Given the challenges with former, maybe it's better to start with the latter one.

skshetry avatar Dec 08 '23 06:12 skshetry

Thinking about another alternative, maybe we should create a top-level datasets in dvc.yaml where users register their versions/datasets, and then they can reference them in dvc stage add?

Eg:

$ dvc dataset add ds://dogs@v2
$ cat dvc.yaml
datasets:
- ds://dogs@v2
$ dvc stage add -d ds://dogs ...
from dvc.api import dataset

dataset.get("ds://dogs") # gets redirected to "ds://dogs@v2" from `datasets` in dvc.yaml

Passing data through envvars won't work if you want to let users do python script.py outside of dvc repro/dvc exp run/dvclive.

~~Also, I am not quite convinced we need a stage-level dependency here, since there are not going to be multiple versions of the same dataset, and it's already registered globally.~~

(We'll need to know if it changed or not to rerun stages).

skshetry avatar Dec 08 '23 07:12 skshetry

Just sent an invite for the 3 of us to discuss next week.

There is also an open question of how dvc pull/dvc checkout, etc would work. And we also need to support non-instantiated/non-materialized datasets as checkout/pull is not always possible and training might happen on streaming datasets. So we also want to have a support for a dependency in your stage for pipelines.

So these are the possible approaches I see for accessing DVCX datasets. Does it look correct to you two?

DVC DVCX - Frame 2(1)

This corresponds to phase 1 and phase 2 in notion.

@skshetry Are you saying we should focus on approach 2?

dberenbaum avatar Dec 08 '23 12:12 dberenbaum

In the second diagram, dvc will have to act as an intermediary.

The dataset info needs to be tracked in dvc's metadata, not in the source code, so that when users update the metadata (via Studio or through CLI), dvc will retrigger stages that changed.

At the same time, the dataset that is being referenced in source code should change to the newer dataset, and should be used for training.

Users could update in both places - in source code and in metadata, but making this one-click as possible - in Studio and in CLI, and be able to retrigger your pipelines and run your changes seems to be a feature that we want here.

skshetry avatar Dec 08 '23 13:12 skshetry

Are you saying we should focus on approach 2?

I am saying that we should start with one integration first.

Regarding the first approach, there is a strict requirement here not to touch the dataset there and also technical challenges here. Given this, I am not sure if dvc needs to be even involved?

Can users use dql to materialize it? Can it be just a dependency or an external dependency from dvc's point of view, and make dql the command to manipulate/checkout/pull datasets?

skshetry avatar Dec 08 '23 13:12 skshetry

Thanks @skshetry. I agree that starting with approach 2 makes sense, just wanted to clarify that's what you meant. I also agree DVC needs awareness of the dataset, but I'm not convinced we need top-level datasets.

Within approach 2, I see two different scenarios:

a. Fixed versions -- I want to specify the exact version every time b. Latest version -- I always want to use the latest but need to snapshot it for reproducibility

2a looks simple to me. Use the DVCX API directly in your code and parametrize it with DVC:

# params.yaml
dataset:
  name: dogs
  version: 2
# dvc.yaml
stages:
  train:
    cmd: python train.py ${dataset}
ds = DatasetQuery(name=name, version=version)

2b is where we need DVC to snapshot the version, but only because the user doesn't want to specify it in code, so it can look like this:

# dvc.yaml
stages:
  train:
    cmd: python train.py
      deps:
        - ds://dogs
# dvc.lock
train:
  cmd: python train.py
  deps:
    - path: ds://dogs
    - version: v2
ds = DatasetQuery(name=dogs)

update the metadata (via Studio or through CLI)

In 2a, it can be done by updating the parameters.

In 2b, the update will happen automatically.

Thinking about another alternative, maybe we should create a top-level datasets in dvc.yaml where users register their versions/datasets, and then they can reference them in dvc stage add?

I would prefer not to because:

  • It's a more complex UX -- need another abstraction to add the dataset to DVCX first, then need to reference that with special syntax in dependencies and in code, and can't use code outside dvc
  • It's less consistent with how we treat other datasets -- other dependencies don't require any dvc-specific syntax in code and don't need to be defined outside the pipeline first
  • It's more work to implement

dberenbaum avatar Dec 08 '23 13:12 dberenbaum

Maybe I am over optimizing here and is unnecessary, but dependencies can point to multiple dataset versions. That's what I was trying to avoid with this registration.

train:
  cmd: python train.py
  deps:
    - path: ds://dogs
      version: v2
    - path: ds://dogs
      version: v1

Secondly, they are stage-local. You can have a stage with a dataset of a version and a separate stage with a different version.

We could easily forbid this though (at least in that DatasetQuery/get_dataset api.

Thirdly, you'll have to update in multiple places than just one. But that's a minor issue.

skshetry avatar Dec 08 '23 14:12 skshetry

@dberenbaum, how would dvc and Studio understand which dataset is being used in 2a?

skshetry avatar Dec 08 '23 14:12 skshetry

Also, do you think 2b) is important? Why don't we push for pinning the dataset for stronger reproducibility?

User can update dvc.yaml when they want to update (and could be one-click from Studio if needed).

skshetry avatar Dec 08 '23 15:12 skshetry

Okay, I think you've convinced me on having top-level datasets as long as we are all aligned that the focus is on approach 2. I have some thoughts on minor modifications, and I do think 2b is important as it's probably the default workflow for a lot of people, but let's take the weekend to think about it and we can follow up on details next week.

dberenbaum avatar Dec 08 '23 18:12 dberenbaum

Adding some additional thoughts. Maybe you thought of some of these but didn't bother to detail yet.

dvc.yaml

$ dvc dataset add ds://dogs@v2
$ cat dvc.yaml
datasets:
- ds://dogs@v2
$ dvc stage add -d ds://dogs ...

Maybe we can generalize this more? We might want to reuse datasets in the future for other data sources (databases, cloud-versioned data, data from other repos). It seems like this is going in the direction of https://github.com/iterative/dvc/issues/2378, where datasets act like callback dependencies that don't rely on file hashing. WDYT about something like this (or we could keep the ds:// syntax but include dvcx somewhere)?

#dvc.yaml
dataset:
  dogs:
    type: dvcx
    name: dogs
    version: v2
stages:
  train:
    deps:
      - dataset: dogs

Also wondering how should it look in dvc.lock?

python api

from dvc.api import dataset

dataset.get("ds://dogs") # gets redirected to "ds://dogs@v2" from `datasets` in dvc.yaml

Can dataset.get() (or some method) return info about the dataset instead of actually reading the dataset? For example:

from dql.query import DatasetQuery
from dvc.api import dataset

ds = DatasetQuery(**dataset.get("dogs"))

As DVCX adds more methods and ways to stream data like pytorch data loaders, etc., we don't want to have to add more DVC API methods to support every DVCX way to load data. Users can keep their native DVCX code and only use DVC to provide the dataset info.

dberenbaum avatar Dec 11 '23 20:12 dberenbaum

@dberenbaum looks great!

the direction of #2378, where datasets act like callback dependencies that don't rely on file hashing

Yes. It's related - you ask update status using an external tool (DVCX, DBT or a custom code). IDK how much the generalization is needed at this point.

Also wondering how should it look in dvc.lock?

I guess, UUID or timestamp of datasets. This is needed to not mix up dogs@v2 with dogs@v2 that was created and deleted last week. Later DVCX will introduce a proper fingerprint but this is a whole separate story.

dmpetrov avatar Dec 11 '23 20:12 dmpetrov

I guess, UUID or timestamp of datasets. This is needed to not mix up dogs@v2 with dogs@v2 that was created and deleted last week. Later DVCX will introduce a proper fingerprint but this is a whole separate story.

Related to #982. Is it important when creating and deleting a dataset that you can reuse the same version numbers? If we have a UUID or timestamp, can DVC still get the deleted dataset to reproduce (assuming the raw storage hasn't changed)? What is the point of version numbers if we also plan to have UUID/timestamp?

dberenbaum avatar Dec 12 '23 16:12 dberenbaum

Name and version are human-readable.

It would be the same if dvc imports a file from a git repository. File name and tags are nice but it's better to use sha to a proper snapshot.

dmpetrov avatar Dec 12 '23 17:12 dmpetrov