dvc.org cases: list of ideas

1. Data Management

[x] Data Versioning - https://dvc.org/doc/use-cases/versioning-data-and-model-files
[ ] "Organizing Team Datasets" per https://github.com/iterative/dvc.org/issues/3186
[x] Data Registries - https://dvc.org/doc/use-cases/data-registries

2. Data Pipeline development

From https://github.com/iterative/dvc.org/issues/2544#issuecomment-857170399 below

model development may include data validation and preprocessing followed by model training and evaluation... compose this as a dag where I can easily and efficiently run only the necessary stages iteratively update data, add features, tune models, etc (overlaps with 2.)

3. Experiment Management

From https://github.com/iterative/dvc.org/issues/2270#issuecomment-793269476

Preliminary ideas:

~~Hyperspace exploration [Tuning/Optimization] ? May be too low level~~ There's a blog about this now.
[x] Experiment ~~Bookkeeping~~ Tracking (with Git): Rapid iterations. UPDATE: https://github.com/iterative/dvc.org/pull/2782
Visualizing Data Science (experiments + params, metrics/plots) + Viewer
[ ] Experiment execution and orchestration (exp+machine+CML?)

From https://github.com/iterative/dvc.org/pull/2782#pullrequestreview-781487421

Here we should sell W&B, MlFlow, etc - rapid iterations, live metrics + other metrics + navigation

4. Production environments/ MLOps

From https://github.com/iterative/dvc.org/issues/2490#issuecomment-853561046

4.1 DVC in Production Training remotely Deploying models (CLI or API) Keep pipelines, artifacts in sync between environments Batch scoring a.k.a. "DVC for ETL" - see https://github.com/iterative/dvc.org/issues/2512#issuecomment-854999981 + Distributed/parallel computing

Good example of user perspective: https://discord.com/channels/485586884165107732/485596304961962003/872860674529845299

4.2 ML Model Registry Model lifecycle (training, shadow, active, inactive) Automated/Continuous training (remotely) Discovery and reusability Deploying models Batch scoring example + Real-time inference

4.3 Production Integrations Databases (e.g. SQL dump versioning/preprocessing) Spark (e.g. remote training) AirFlow (e.g. batch scoring) Kafka (e.g. real-time predictions)

4.4 End-to-end scenario with a combination from above, e.g.: Importing data from Spark Training remotely Model Registry Ops Batch scoring (AirFlow integration)

Jun 08 '21 18:06 jorgeorpinel

Do we need a development/pipelines-related use case? We have https://dvc.org/doc/use-cases/versioning-data-and-model-files, which addresses model development but focuses on versioning and not pipelines. My model development may include data validation and preprocessing followed by model training and evaluation, and I iteratively update data, add features, tune models, etc. Pipelines can help compose this as a dag with distinct stages where I can easily and efficiently execute the pipeline and run only the necessary stages when I make changes.

Jun 08 '21 21:06 dberenbaum

model development may include data validation and preprocessing followed by model training and evaluation... compose this as a dag where I can easily and efficiently run only the necessary stages

Good catch! And in fact that probably goes even "before" experiment mgmt or production envs/ MLOps.

iteratively update data, add features, tune models, etc

This overlaps with experiment management. Which is fine. But if it's too much we can leave it for the Exps-related use case (just mention).

UPDATE: Added to description

Jun 08 '21 21:06 jorgeorpinel

AirFlow (e.g. batch scoring) ... End-to-end scenario

Cc @mnrozhkov I know you've worked quite a bit on this topic. So just pinging you here for visibility

p.s. our docs use cases are not enterprise-level so far, rather high-level and short. If you'd be interested in drafting one around these topics using your existing material please lmk!

Aug 06 '21 04:08 jorgeorpinel

Guys I'm giving this priority again per our current roadmap (now that #2587 is basically finished). I think Experiment Management is the most needed topic now, and along the lines @iesahin and I are working on (rel. #2548). But if anyone thinks another direction should have higher priority please comment.

And if we agree on Exp Mgmt. What should be the spin? i.e. user perspective problem/solution and key concepts. I discussed briefly with @shcheklein and we think it could be centered around running and managing rapid iterations in DS projects (without Git overhead) and concepts bookkeeping, hyperparameters, metrics, visualization.

What do you think? Cc @dberenbaum @flippedcoder @jendefig @casperdcl @tapadipti @dmpetrov @pmrowla

Aug 06 '21 04:08 jorgeorpinel

Bookkeeping + visualization seems the most relevant path to follow. Something along the lines of "push experiments to a central repository and see their comparative plots."

Aug 07 '21 05:08 iesahin

Some ideas for 3 (re - production environments/ MLOps)

path from development to production could be better... as a mode of operation I would favor a model where runs (e.g. artifacts, metrics, params, etc.) are pushed to production from a development environment. I am arguing for a model like git with remotes... where runs are captured locally first and then if confirmed a run can be pushed to a remote server. A model like this just keeps things more tidy... authentication could also be directly supported to make it easier to deploy for production... For more production-oriented organizations ... for example production model monitoring

From https://megagon.ai/blog/whatmlflowsolvesanddoesntforus/

Dec 07 '21 20:12 jorgeorpinel

Interesting diagram inspiration for 1.3 or 1.4

From https://medium.com/google-cloud/migrate-kedro-pipeline-on-vertex-ai-fa3f2c6f7aad

Jan 04 '22 01:01 jorgeorpinel

4.3 Production Integrations Databases (e.g. SQL dump versioning/preprocessing) Spark (e.g. remote training) AirFlow (e.g. batch scoring) Kafka (e.g. real-time predictions)

Feast/feature stores: https://discord.com/channels/485586884165107732/563406153334128681/969249645073145896

Apr 28 '22 15:04 dberenbaum