dvc.org icon indicating copy to clipboard operation
dvc.org copied to clipboard

guide: ML Pipelines (1): Defining Pipelines & Stages

Open iesahin opened this issue 2 years ago • 13 comments

Related to #2883

  • [x] Index page: General explanation about ML Pipelines
  • [x] Defining Pipelines page. Covers:
    • Stages
    • Dependency graph (DAG)
    • dvc.yaml intro
    • Commands (cmd) details
    • Dependencies (overview)
  • [x] Remove explanations on these topics from other pages (mainly run, stage add, and repro refs)
  • [x] Update many links accordingly
  • [x] Add links from appropriate basic concept tooltips
  • [ ] Update Glossary accordingly?

In review app: https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/machine-learning-pipelines

iesahin avatar Apr 05 '22 15:04 iesahin

Hi! Are we ready for reviews here? Still a draft

jorgeorpinel avatar May 09 '22 20:05 jorgeorpinel

Gatsby Cloud Build Report

dvc.org

:tada: Your build was successful! See the Deploy preview here.

Build Details

View the build logs here.

:clock1: Build time: 1m

Performance

Lighthouse report

Metric Score
Performance :large_orange_diamond: 61
Accessibility :green_heart: 98
Best Practices :large_orange_diamond: 83
SEO :green_heart: 93

:link: View full report

gatsby-cloud[bot] avatar May 09 '22 20:05 gatsby-cloud[bot]

@jorgeorpinel do we plan to take it over or should we close this?

shcheklein avatar Jul 12 '22 00:07 shcheklein

Taking this over. Let me see if we can get a decent guide quickly based on the existing content here (iterate once)... ⏳

jorgeorpinel avatar Jul 13 '22 17:07 jorgeorpinel

Looks like this is a larger effort than I hoped. For now I separated the existing content and focused on the Pipeline Definition guide -- now we can remove some detail from https://dvc.org/doc/user-guide/project-structure/pipelines-files (needed for other work such as https://github.com/iterative/dvc.org/pull/3691#pullrequestreview-1031804580.

PTAL if you can @shcheklein

jorgeorpinel avatar Jul 14 '22 02:07 jorgeorpinel

Link Check Report

  • content/docs/command-reference/dag.md

    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph-dag = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph-dag (404)
  • content/docs/command-reference/exp/init.md

    • PASS: /doc/command-reference/config#exp = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/command-reference/config#exp (200)
    • PASS: /doc/user-guide/project-structure/dvcyaml-files#stage-commands = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files#stage-commands (200)
    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
  • content/docs/command-reference/move.md

    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
  • content/docs/command-reference/repro.md

    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior (404)
    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
  • content/docs/command-reference/run.md

    • PASS: /doc/command-reference/repro = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/command-reference/repro (200)
    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior (404)
    • PASS: /doc/user-guide/project-structure/dvcyaml-files#stage-commands = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files#stage-commands (200)
    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
    • PASS: /doc/user-guide/pipelines/defining-pipelines#parameter-dependencies = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/pipelines/defining-pipelines#parameter-dependencies (200)
  • content/docs/command-reference/stage/add.md

    • PASS: /doc/command-reference/repro = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/command-reference/repro (200)
    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines#pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#pipelines (404)
    • PASS: /doc/user-guide/project-structure/dvcyaml-files#stage-commands = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files#stage-commands (200)
    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
    • PASS: /doc/user-guide/pipelines/defining-pipelines#parameter-dependencies = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/pipelines/defining-pipelines#parameter-dependencies (200)
  • content/docs/command-reference/stage/index.md

    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines#stages = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#stages (404)
  • content/docs/start/data-management/metrics-parameters-plots.md

    • PASS: /doc/start/data-pipelines#dependency-graphs-dag = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/start/data-pipelines#dependency-graphs-dag (200)
  • content/docs/start/data-management/pipelines.md

    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
  • content/docs/user-guide/basic-concepts/pipeline.md

    • PASS: /doc/user-guide/project-structure/dvcyaml-files = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files (200)
  • content/docs/user-guide/experiment-management/running-experiments.md

    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph (404)
  • content/docs/user-guide/pipelines/defining-pipelines.md

    • PASS: https://www.gnu.org/software/software.html (200)
    • PASS: /doc/user-guide/external-dependencies = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/external-dependencies (200)
    • PASS: /doc/use-cases/versioning-data-and-model-files = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/use-cases/versioning-data-and-model-files (200)
    • PASS: /doc/start/data-management = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/start/data-management (200)
    • PASS: /doc/command-reference/metrics = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/command-reference/metrics (200)
    • PASS: /doc/user-guide/visualizing-plots = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/visualizing-plots (200)
    • PASS: https://en.wikipedia.org/wiki/Directed_acyclic_graph (200)
    • PASS: /doc/user-guide/project-structure/dvcyaml-files#stage-entries = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files#stage-entries (200)
    • PASS: /doc/user-guide/project-structure/dvcyaml-files#stage-commands = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files#stage-commands (200)
    • PASS: /doc/user-guide/project-structure/pipelines-files#templating = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/pipelines-files#templating (200)
    • PASS: /doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory (200)
    • PASS: /doc/user-guide/project-structure/dvcyaml-files#parameters-files = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files#parameters-files (200)
    • FAIL: https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/ (403)
    • PASS: /doc/user-guide/project-structure/dvcyaml-files#parameters = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files#parameters (200)
  • content/docs/user-guide/pipelines/index.md

    • PASS: /doc/start/data-management/pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/start/data-management/pipelines (200)
    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
  • content/docs/user-guide/project-structure/dvcyaml-files.md

    • PASS: https://www.gnu.org/software/software.html (200)
  • content/docs/user-guide/project-structure/internal-files.md

    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior (404)
  • content/docs/user-guide/related-technologies.md

    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
    • FAIL: /doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph-dag = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph-dag (404)
  • content/docs/user-guide/what-is-dvc.md

    • FAIL: /doc/user-guide/data-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines (404)

18/43 links failed.

github-actions[bot] avatar Jul 20 '22 01:07 github-actions[bot]

I only worked on the index and Defining Pipelines pages so far. I'll remove this page and focus on Reproducing Pipelines next.

Actually just Defining Pipelines has been more than enough for this PR given all the satellite work related (remove info from other docs, update links, etc.) Please see the updated OP for details @shcheklein

jorgeorpinel avatar Jul 20 '22 06:07 jorgeorpinel

Lots more updates here. While mainly they simplify the docs (reduce duplicity, clarify things) I realize the PR is growing a lot. I still want to refactor things such as absorbing the Dependencies and outputs section of the stage add (and run) ref. but I guess ~~all~~ most cmd ref changes could be split into (yet) another PR to make this one reviewable ⏳

jorgeorpinel avatar Jul 21 '22 07:07 jorgeorpinel

This looks like a great addition to the user guide to me! A much more "holistic" look at pipelines than anything in the guide today. Thanks for taking this on @jorgeorpinel!

dberenbaum avatar Aug 03 '22 19:08 dberenbaum

@iterative/websites this review app deployed OK but is crashing. I can't tell why from the logs. I restarted the workers and it didn't help, and Idk how to destroy/recreate the whole env. Probably no big deal (I can sanity check locally) but you may want to take a look.

jorgeorpinel avatar Aug 04 '22 05:08 jorgeorpinel

@iterative/websites this review app deployed OK but is crashing.

It's because of a Twitter post. The user changed the visibility(so it throws an unauthorized error). I am trying to figure out the proper way to handle it instead of just removing it.

yathomasi avatar Aug 04 '22 10:08 yathomasi

This has happened before, I'm pretty sure what we did then is ask @jendefig to replace the tweet with another one. Could be better to just add the content in a blockquote of we remember it.

rogermparent avatar Aug 04 '22 14:08 rogermparent

@jorgeorpinel please redeploy it (rebase?)

shcheklein avatar Aug 06 '22 03:08 shcheklein

It overlaps with the DVC.yaml project structure now

Yeah. It's a generalized problem we have now. I'm working on a new master plan for DVC docs based on the proposed framework to minimize this ⏳ although some overlap is expected.

What is the purpose of this doc?

  1. It should be a mid-level doc that explains what pipelines are and how to use the feature (so far defining them).
  2. It's also where we can explain working concepts like codification, DAG, etc.
  3. Follows the general plan in https://github.com/iterative/dvc.org/issues/2883.

TBH the "explanation" type doc is still the least concrete in the framework. User Guides can have different levels and purposes depending on the feature; not as consistent as command references or get started pages. Rel. https://github.com/iterative/dvc.org/issues/144#issuecomment-1206648437

jorgeorpinel avatar Aug 23 '22 21:08 jorgeorpinel

Okay, @jorgeorpinel let's proceed with this and see how it evolves. We should be very careful though so that we don't make it just a collection of command reference or lang formal definitions.

shcheklein avatar Aug 23 '22 22:08 shcheklein

I worked on the Parameters info. as well now. I'm trying to keep each part of the doc purposeful on that topic:

  • [x] The new guide has a Parameter dependencies section to explain params from the PoV of pipelining.
  • [x] For the formalities (spec) there's Param deps in the dvc.yaml guide/ref.
  • [x] A parameters file reference may be missing (part of project structure). For now there's just a note about the supported formats in a few places. UPDATE: Made a small section about this for now.
  • [x] I simplified the dvc params ref but didn't remove it completely (unlike we did for dvc plots recently): good to have a quick usage ref, esp since params can relate to pipelines OR experiments, so no one guide can serve as single-source of truth.
  • [x] Note that significant (overlapping) info on params also exists in the Experiments guide -- should probably revisit that.

It's still not a perfect separation but I think it's better and helps us get closer to a standard framework... But I also realized we need some sort of conceptual map of the product features to come up with a good docs org. ⏳

p.s. Maybe this part should be a separate PR though... Lmk

jorgeorpinel avatar Aug 24 '22 09:08 jorgeorpinel

I haven't done a full review of the latest iteration but took a look where mentioned. I'm happy to let you two merge when ready, but let me know if you need a full review.

dberenbaum avatar Aug 30 '22 18:08 dberenbaum

happy to let you two merge when ready

I think it's mergeable @dberenbaum but needs an approval.

jorgeorpinel avatar Sep 02 '22 20:09 jorgeorpinel

@jorgeorpinel it looks better, I still don't like it tbh.

I think the whole pipelines and defining pipelines section should be focused on the first section of the page (where we describe the process). I feel that describing again formally different types of outs, deps, stage doesn't make sense here (at least because it overlaps with a formal definition).

We should probably talk more about dvc exp init here? (since it helps to bootstrap the dvc.yaml after all)?

we should provide some example - actual pipepline files? mention VS Code as an editor that supports schema definition, etc

Include things like Jupyter notebooks - how to make a pipeline out of it ... etc

wdyt @dberenbaum @jorgeorpinel ?

shcheklein avatar Sep 06 '22 18:09 shcheklein

Your proposal makes sense to me @shcheklein . I think we can update #2883 based on that, merge #3899 and this, and follow up on that as well as remaining topics for this guide (Reproduction, Operationalizing, Experimentation -- I have drafts of all these docs).

We should probably talk more about dvc exp init here?

That one I imagined should be a separate page (Experimenting with/ Experimental Pipelines), but we could def. mention exp init here and link to that new page when we get there.

jorgeorpinel avatar Sep 07 '22 04:09 jorgeorpinel

Agree with @jorgeorpinel that the proposals from @shcheklein make sense but we can include them in future PRs. As long as this PR improves upon the current docs and there's nothing wrong/blocking in it, can we merge?

dvc exp init and Jupyter notebook migration are still not well defined and might make this an endless PR IMHO. We are also having product discussions related to those ideas, so let me think about how we can consolidate the product and docs discussions here 🤔 🙏 .

dberenbaum avatar Sep 07 '22 18:09 dberenbaum