kedro icon indicating copy to clipboard operation
kedro copied to clipboard

Document usage of Kedro + DVC

Open astrojuanlu opened this issue 1 year ago • 17 comments

Description

It would be nice if we had a page on our docs that described how DVC and Kedro can be used together.

Context

Kedro users have been asking for DVC for some time. For example:

  • https://github.com/kedro-org/kedro/discussions/837
  • https://github.com/FactFiber/kedro-dvc
  • To some extent, https://github.com/kedro-org/kedro/issues/221 (our oldest open issue at the moment)
  • https://www.linen.dev/s/kedro/t/11183370/does-kedro-work-with-dvc-or-any-data-version-control-solutio#3cbc0e93-e00d-4d3a-ac2d-06fc6b88914f
  • https://www.linen.dev/s/kedro/t/10118858/we-are-discussing-in-the-team-kedro-integration-and-several-#815f46fc-0160-400b-8f74-86abd8dd2285
  • https://www.linen.dev/s/kedro/t/12350811/hello-i-am-using-kedro-with-dvc-for-data-version-control-the#2149558d-d5a9-4327-8fdc-4b44210f8cb8

The scope of this issue is to document how such thing can be done, but there's a chance that the best way of documenting it is to contribute a plugin.

astrojuanlu avatar Jun 15 '23 11:06 astrojuanlu

I asked Chat GPT; what's the quality like for this response? Just wondering if we can use it start off a page...

Kedro and DVC are both tools that can be used for data science and machine learning projects. While they serve different purposes, they can be used together to enhance the reproducibility and version control aspects of your projects. Here's how you can integrate Kedro and DVC:

Initialize a Kedro project: Start by creating a new Kedro project using the kedro new command. This will set up the basic project structure and configuration files.

Set up DVC: Install DVC and initialize it within your Kedro project directory using the dvc init command. This will create the necessary DVC files and directories.

Define your Kedro pipelines: Use Kedro to define your data processing pipelines. This involves creating modular pipeline nodes that represent individual steps in your data science workflow.

Version your data with DVC: Use DVC to version your data files and track their changes over time. By using the dvc add command, you can tell DVC to track specific data files in your Kedro project. This ensures that your data remains under version control and can be easily reproduced.

Integrate DVC with Kedro pipelines: Modify your Kedro pipeline definitions to include DVC commands. For example, you can use the dvc run command within a Kedro pipeline node to run a data processing step and have DVC track the inputs and outputs of that step.

Use DVC for data versioning and sharing: DVC provides features for managing large datasets and collaborating with others. You can use DVC's remote storage options (such as AWS S3 or Google Cloud Storage) to store your data and share it with your team or collaborators.

By combining Kedro and DVC, you can leverage Kedro's pipeline management capabilities with DVC's data versioning and reproducibility features. This allows you to track and reproduce your data science experiments effectively, ensuring the integrity and consistency of your results.

stichbury avatar Jun 16 '23 09:06 stichbury

@stichbury brilliant idea!

noklam avatar Jun 16 '23 09:06 noklam

Sure, please assign me, I want to contribute and learn on the go

JaynouOliver avatar Oct 14 '23 14:10 JaynouOliver

Hi @JaynouOliver, go ahead! No need to assign the issue, start working on a new documentation page and open a pull request when it's ready for a first review.

astrojuanlu avatar Oct 14 '23 15:10 astrojuanlu

Sure!

JaynouOliver avatar Oct 15 '23 04:10 JaynouOliver

Interesting perspective from a DVC user: https://fosstodon.org/@blakeNaccarato/111256190959866234

I appreciate the separation of concerns that working with DVC facilitates. Stages as shell commands make non-Python stages trivial. It's good for general processing outside research pipelines too, e.g. document processing.

Stage caching is enabled by hash comparison of deps/outs on disk and avoids costly recompute.

But this design forces disk access between each stage and lots of intermediate files. An abstraction enabling all-in-memory stages could help at the expense of caching.

astrojuanlu avatar Oct 18 '23 20:10 astrojuanlu

Today @datajoely mentioned this in our Slack, didn't realize that our dataset versioning sort of overlaps https://linen-slack.kedro.org/t/16014653/hello-very-much-new-to-the-ml-world-i-m-trying-to-setup-a-fr#e111a9d2-188c-4cb3-8a64-37f938ad21ff

DVC and Kedro don’t gell super nicely together, it can be done but our support for native DataSet versioning and Delta (spark) (non-spark) also work in this space

astrojuanlu avatar Oct 26 '23 10:10 astrojuanlu

Hi @JaynouOliver -- how are you? Today is the last day of October so please do slip any PRs into our queue if you have them for Hacktoberfest.

stichbury avatar Oct 31 '23 10:10 stichbury

Hi. I was not doing it for hacktoberfest. Mind if I submit it by tomorrow?

JaynouOliver avatar Oct 31 '23 10:10 JaynouOliver

Then that's grand, yes please, that would work for us. Thank you.

stichbury avatar Oct 31 '23 10:10 stichbury

For the record, yesterday two users asked me how to combine Kedro and DVC.

astrojuanlu avatar Jan 25 '24 09:01 astrojuanlu

For the record, yesterday two users asked me how to combine Kedro and DVC.

Did you tell them? Did you write it down? If not, is the above generated content any use? Shall we publish?

I have many questions.

stichbury avatar Jan 25 '24 09:01 stichbury

It was an in-person chat after my talk. I told them to try https://github.com/FactFiber/kedro-dvc/ but also warned them that Kedro versioning is not easily configurable so it might be hard https://github.com/kedro-org/kedro/issues/2355 I think this has to be an engineering spike before a documentation issue.

astrojuanlu avatar Jan 25 '24 10:01 astrojuanlu

Perfect, thanks for the background and also for the change in the ticket, makes sense to me.

stichbury avatar Jan 25 '24 10:01 stichbury

We're looking at this in the context of broader versioning and dataset research. If you have thoughts on this please comment on #3997.

merelcht avatar Jul 15 '24 13:07 merelcht