dvc.org
dvc.org copied to clipboard
guide: ML Pipelines (1): Defining Pipelines & Stages
Related to #2883
- [x] Index page: General explanation about ML Pipelines
- [x] Defining Pipelines page. Covers:
- Stages
- Dependency graph (DAG)
-
dvc.yaml
intro - Commands (
cmd
) details - Dependencies (overview)
- [x] Remove explanations on these topics from other pages (mainly
run
,stage add
, andrepro
refs) - [x] Update many links accordingly
- [x] Add links from appropriate basic concept tooltips
- [ ] Update Glossary accordingly?
In review app: https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/machine-learning-pipelines
Hi! Are we ready for reviews here? Still a draft
Gatsby Cloud Build Report
dvc.org
:tada: Your build was successful! See the Deploy preview here.
Build Details
:clock1: Build time: 1m
Performance
Lighthouse report
Metric | Score |
---|---|
Performance | :large_orange_diamond: 61 |
Accessibility | :green_heart: 98 |
Best Practices | :large_orange_diamond: 83 |
SEO | :green_heart: 93 |
:link: View full report
@jorgeorpinel do we plan to take it over or should we close this?
Taking this over. Let me see if we can get a decent guide quickly based on the existing content here (iterate once)... ⏳
Looks like this is a larger effort than I hoped. For now I separated the existing content and focused on the Pipeline Definition guide -- now we can remove some detail from https://dvc.org/doc/user-guide/project-structure/pipelines-files (needed for other work such as https://github.com/iterative/dvc.org/pull/3691#pullrequestreview-1031804580.
PTAL if you can @shcheklein
Link Check Report
-
content/docs/command-reference/dag.md
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph-dag = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph-dag (404)
-
content/docs/command-reference/exp/init.md
- PASS: /doc/command-reference/config#exp = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/command-reference/config#exp (200)
- PASS: /doc/user-guide/project-structure/dvcyaml-files#stage-commands = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files#stage-commands (200)
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
-
content/docs/command-reference/move.md
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
-
content/docs/command-reference/repro.md
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior (404)
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
-
content/docs/command-reference/run.md
- PASS: /doc/command-reference/repro = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/command-reference/repro (200)
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior (404)
- PASS: /doc/user-guide/project-structure/dvcyaml-files#stage-commands = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files#stage-commands (200)
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
- PASS: /doc/user-guide/pipelines/defining-pipelines#parameter-dependencies = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/pipelines/defining-pipelines#parameter-dependencies (200)
-
content/docs/command-reference/stage/add.md
- PASS: /doc/command-reference/repro = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/command-reference/repro (200)
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines#pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#pipelines (404)
- PASS: /doc/user-guide/project-structure/dvcyaml-files#stage-commands = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files#stage-commands (200)
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
- PASS: /doc/user-guide/pipelines/defining-pipelines#parameter-dependencies = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/pipelines/defining-pipelines#parameter-dependencies (200)
-
content/docs/command-reference/stage/index.md
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines#stages = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#stages (404)
-
content/docs/start/data-management/metrics-parameters-plots.md
- PASS: /doc/start/data-pipelines#dependency-graphs-dag = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/start/data-pipelines#dependency-graphs-dag (200)
-
content/docs/start/data-management/pipelines.md
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
-
content/docs/user-guide/basic-concepts/pipeline.md
- PASS: /doc/user-guide/project-structure/dvcyaml-files = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files (200)
-
content/docs/user-guide/experiment-management/running-experiments.md
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph (404)
-
content/docs/user-guide/pipelines/defining-pipelines.md
- PASS: https://www.gnu.org/software/software.html (200)
- PASS: /doc/user-guide/external-dependencies = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/external-dependencies (200)
- PASS: /doc/use-cases/versioning-data-and-model-files = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/use-cases/versioning-data-and-model-files (200)
- PASS: /doc/start/data-management = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/start/data-management (200)
- PASS: /doc/command-reference/metrics = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/command-reference/metrics (200)
- PASS: /doc/user-guide/visualizing-plots = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/visualizing-plots (200)
- PASS: https://en.wikipedia.org/wiki/Directed_acyclic_graph (200)
- PASS: /doc/user-guide/project-structure/dvcyaml-files#stage-entries = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files#stage-entries (200)
- PASS: /doc/user-guide/project-structure/dvcyaml-files#stage-commands = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files#stage-commands (200)
- PASS: /doc/user-guide/project-structure/pipelines-files#templating = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/pipelines-files#templating (200)
- PASS: /doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory (200)
- PASS: /doc/user-guide/project-structure/dvcyaml-files#parameters-files = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files#parameters-files (200)
- FAIL: https://machinelearningmastery.com/difference-between-a-batch-and-an-epoch/ (403)
- PASS: /doc/user-guide/project-structure/dvcyaml-files#parameters = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/project-structure/dvcyaml-files#parameters (200)
-
content/docs/user-guide/pipelines/index.md
- PASS: /doc/start/data-management/pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/start/data-management/pipelines (200)
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
-
content/docs/user-guide/project-structure/dvcyaml-files.md
- PASS: https://www.gnu.org/software/software.html (200)
-
content/docs/user-guide/project-structure/internal-files.md
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior (404)
-
content/docs/user-guide/related-technologies.md
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines (404)
- FAIL: /doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph-dag = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph-dag (404)
-
content/docs/user-guide/what-is-dvc.md
- FAIL: /doc/user-guide/data-pipelines = https://dvc-org-iesahin-ug-pipe-qykfpz.herokuapp.com/doc/user-guide/data-pipelines (404)
18/43 links failed.
I only worked on the index and Defining Pipelines pages so far. I'll remove this page and focus on Reproducing Pipelines next.
Actually just Defining Pipelines has been more than enough for this PR given all the satellite work related (remove info from other docs, update links, etc.) Please see the updated OP for details @shcheklein
Lots more updates here. While mainly they simplify the docs (reduce duplicity, clarify things) I realize the PR is growing a lot. I still want to refactor things such as absorbing the Dependencies and outputs section of the stage add
(and run
) ref. but I guess ~~all~~ most cmd ref changes could be split into (yet) another PR to make this one reviewable ⏳
This looks like a great addition to the user guide to me! A much more "holistic" look at pipelines than anything in the guide today. Thanks for taking this on @jorgeorpinel!
@iterative/websites this review app deployed OK but is crashing. I can't tell why from the logs. I restarted the workers and it didn't help, and Idk how to destroy/recreate the whole env. Probably no big deal (I can sanity check locally) but you may want to take a look.
@iterative/websites this review app deployed OK but is crashing.
It's because of a Twitter post. The user changed the visibility(so it throws an unauthorized error). I am trying to figure out the proper way to handle it instead of just removing it.
This has happened before, I'm pretty sure what we did then is ask @jendefig to replace the tweet with another one. Could be better to just add the content in a blockquote of we remember it.
@jorgeorpinel please redeploy it (rebase?)
It overlaps with the DVC.yaml project structure now
Yeah. It's a generalized problem we have now. I'm working on a new master plan for DVC docs based on the proposed framework to minimize this ⏳ although some overlap is expected.
What is the purpose of this doc?
- It should be a mid-level doc that explains what pipelines are and how to use the feature (so far defining them).
- It's also where we can explain working concepts like codification, DAG, etc.
- Follows the general plan in https://github.com/iterative/dvc.org/issues/2883.
TBH the "explanation" type doc is still the least concrete in the framework. User Guides can have different levels and purposes depending on the feature; not as consistent as command references or get started pages. Rel. https://github.com/iterative/dvc.org/issues/144#issuecomment-1206648437
Okay, @jorgeorpinel let's proceed with this and see how it evolves. We should be very careful though so that we don't make it just a collection of command reference or lang formal definitions.
I worked on the Parameters info. as well now. I'm trying to keep each part of the doc purposeful on that topic:
- [x] The new guide has a Parameter dependencies section to explain params from the PoV of pipelining.
- [x] For the formalities (spec) there's Param deps in the dvc.yaml guide/ref.
- [x] A parameters file reference may be missing (part of project structure). For now there's just a note about the supported formats in a few places. UPDATE: Made a small section about this for now.
- [x] I simplified the
dvc params
ref but didn't remove it completely (unlike we did fordvc plots
recently): good to have a quick usage ref, esp since params can relate to pipelines OR experiments, so no one guide can serve as single-source of truth. - [x] Note that significant (overlapping) info on params also exists in the Experiments guide -- should probably revisit that.
It's still not a perfect separation but I think it's better and helps us get closer to a standard framework... But I also realized we need some sort of conceptual map of the product features to come up with a good docs org. ⏳
p.s. Maybe this part should be a separate PR though... Lmk
I haven't done a full review of the latest iteration but took a look where mentioned. I'm happy to let you two merge when ready, but let me know if you need a full review.
happy to let you two merge when ready
I think it's mergeable @dberenbaum but needs an approval.
@jorgeorpinel it looks better, I still don't like it tbh.
I think the whole pipelines and defining pipelines section should be focused on the first section of the page (where we describe the process). I feel that describing again formally different types of outs, deps, stage doesn't make sense here (at least because it overlaps with a formal definition).
We should probably talk more about dvc exp init
here? (since it helps to bootstrap the dvc.yaml after all)?
we should provide some example - actual pipepline files? mention VS Code as an editor that supports schema definition, etc
Include things like Jupyter notebooks - how to make a pipeline out of it ... etc
wdyt @dberenbaum @jorgeorpinel ?
Your proposal makes sense to me @shcheklein . I think we can update #2883 based on that, merge #3899 and this, and follow up on that as well as remaining topics for this guide (Reproduction, Operationalizing, Experimentation -- I have drafts of all these docs).
We should probably talk more about dvc exp init here?
That one I imagined should be a separate page (Experimenting with/ Experimental Pipelines), but we could def. mention exp init
here and link to that new page when we get there.
Agree with @jorgeorpinel that the proposals from @shcheklein make sense but we can include them in future PRs. As long as this PR improves upon the current docs and there's nothing wrong/blocking in it, can we merge?
dvc exp init
and Jupyter notebook migration are still not well defined and might make this an endless PR IMHO. We are also having product discussions related to those ideas, so let me think about how we can consolidate the product and docs discussions here 🤔 🙏 .