dvc.org
dvc.org copied to clipboard
guide: add "Best Practices"
~~UPDATE: Possibly as a How To guide (see #899)~~
Looks like we need a special section describing ways how to organize your projects:
- [ ] how to use DVC with DB (see https://github.com/iterative/dvc.org/issues/594)
- ~~our default
Dvcfile
trick~~ - [x] manually editing dvc.yaml +
dvc commit
ordvc repro
(see also https://github.com/iterative/dvc.org/issues/230#issuecomment-511769103) it's safe to edit DVC files, no need to touch or updatemd5
, DVC will take care of it UPDATE: See #2578 - ~~specify meaningful stage names with
-f
~~ - [ ] creating a pipeline in a 'debug' directory and then ~~moving it to different data sets~~
- ~~creating a pipeline in a 'debug' directory and then modifying respective DVC files to~~ set different data sets as an input
- [x] add "use meta to preserve your content" - #306
- [ ] never store user credentials in the DVC project config
- [x] one vs many dvc.yaml files (from https://github.com/iterative/dvc.org/issues/2170#issuecomment-776141460)
See also the latest relevant https://github.com/iterative/dvc.org/issues/72#issuecomment-682868683 and below.
Also worth mentioning our default Dvcfile trick.
it's safe to edit dvc files, no need to touch or update md5, dvc will take care of it
Specify meaningful stage names
Also, we are not sure about branches anymore.
Never store user credentials in the DVC project config.
@shcheklein we should keep branches - it is a good practice. However, we should mention that for some cases like hyperparameters tuning branches are not very relevant.
@dmpetrov agreed, you are right. I just probably wanted to highlight that we should not be pushing branches as a single best option in all case - there are tags, directories, may be even mention other tools for now?
@shcheklein also we need to implement experiment dir\output feature for hyperparameters tuning use case (Stefan's use case).
add "use meta to preserve your content" - https://github.com/iterative/dvc.org/issues/306
HIi @shcheklein. I would like to work on this issue.
@Soumya0803 sure! feel free to write a document for this. Please join our chat dvc.org/chat, we have separate #dev-docs channel if you have any questions.
Is not "Best Practices" the same as "Use Cases"? Maybe we should rename "Use Cases" ==> "Best Practices"
@dashohoxha no, it's not the same. "Best Practices" are relatively small tricks and advices you should be using to be efficient with DVC. They are usually general and do not depend on your specific use case.
Should this be merged with #230 and featured in the #899 epic? We're trying to avoid so many sections now.
Also, the Questions part of What is DVC? (currently in https://dvc.org/doc/user-guide/what-is-dvc/collaboration-issues#questions) probably overlaps with this.
@jorgeorpinel Yeah, that indeed seems suitable.
@jorgeorpinel how would it looks like? like a subsection in How To?
Just a single document under How To.
I updated the description of this issue and in fact I think #230 is already included here, in the "manually editing dvc.yaml + dvc commit or dvc repro" checkbox.
UPDATES:
Just a single document under How To.
We are currently following this approach in #1705 but I'm not sure it will stick. Maybe Best Practices should be in the form of Explanation (a regular user guide, or directly under Home, even) and not as a How-to (problem-solution format). We'll see...
- ~~And another best practice to write about is on tracking/versioning compressed archives, composite binaries, even video perhaps (see this support case)~~ - overlaps with #682 though.
Another one (or anti-practice):
- [ ] Avoid dynamic names (and other non-deterministic behavior — mentioned in dvc run ref). See this support case for context.
@efiop do you think how to: add a page for Managing Experiments #816 would be better as a best practice too? Instead of a how-to as it's requested now. Thanks
It's definitely not How to
. It is of the same level as Managing data
, etc. Ot my mind section like Managing Experiments should be within Get started, Use Cases, and User Guide at the top level.
More:
- [ ] how to work with overlapping stage output locations (e.g. hopefully with wildcards in deps/outs soon) — see https://discuss.dvc.org/t/managing-pipelines-operating-per-dataset-element/613/4 for a current alternative.
- [ ] DVC in Production setup (see https://github.com/iterative/dvc.org/issues/862#issuecomment-848396315)
- [ ] Add
dvc version
to your first DVC repo commit? Or another way to know what version(s) you've used (since file formats may change, especially between major versions).
Looking at open check boxes, most or all of these topics are addressed I think (the relevant ones at least).
how to use DVC with DB
This would be a how-to, but is it still something we want to have official docs for? Not really matching DVC's approach
creating a pipeline in a 'debug' directory and then ... set different data sets as an input
@efiop is this some sort of bootstrapping method? Is it really something people do? What problem does it solve?
never store user credentials in the DVC project config
We do stress the use of --local
for sensitive configurations (especially in remote modify
). Should be enough I believe.
Avoid dynamic names
We mention non-deterministic behavior in general in https://dvc.org/doc/command-reference/stage/add#avoiding-unexpected-behavior and avoiding ad hoc file naming for versioning is a core use case..
all of these topics are addressed
So we could close this ticket. That said we still don't have a "Best Practices" section or guide(s). Do we want to? Maybe Use Cases or the trails of the Get Started already cover this need (informing users of the main/recommended patterns for DVC project setup/usage).
WDYT @shcheklein @dberenbaum ? Thanks
I'm fine to close this. Not all items are covered though and yes, we could have done a good page that contains tips/faq for pipelines, for data, general project structure (e.g. never store user credentials in the DVC project config) ...