etl icon indicating copy to clipboard operation
etl copied to clipboard

MDIM/Explorer config harmonization

Open lucasrodes opened this issue 9 months ago • 10 comments

Currently, we have tooling for MDIMs and ETL-export Explorers in etl.multidim and etl.explorer. However, these two modules could share some logic. The current structure is not very suitable for this.

Generally, this tooling is used for handling config files, which ideally should be very close to one another.

Proposal

  • New module: etl.collections. In it, we have:
    • etl.collections.base (preliminar name): which contains common logic.
    • etl.collections.utils (preliminar name): which contains smaller util functions.
    • etl.collections.multidim: MDIM-specific logic
    • etl.collections.explorer: Explorer-specific logic.
    • etc.

Others:

  • We need classes for the config of MDIMs and Explorers (wrappers around the plain YML config).
  • We need to figure out a way to load the enriched config from a step. Currently, we rely on paths.load_mdim_config, which returns a plain dictionary. This function is in etl.helpers, which can't really import from etl.collections.multidim, since this latter imports the former already.

Related work

  • https://github.com/owid/etl/pull/4030
  • https://github.com/owid/etl/pull/4035
  • https://github.com/owid/etl/pull/4106
  • https://github.com/owid/etl/pull/4095

lucasrodes avatar Feb 25 '25 17:02 lucasrodes

Partly addressing this in https://github.com/owid/etl/pull/4030

lucasrodes avatar Feb 25 '25 17:02 lucasrodes

Let us know when you plan to move files or a big restructure, please. There are some open PRs for mdims and rebasing them could get nasty.

Marigold avatar Feb 26 '25 07:02 Marigold

I've merged my changes from https://github.com/owid/etl/pull/4030, which have put in place the space of etl.collections, and moved some of the logic we had in etl.multidim in there.

I'm resolving conflicts now:

  • [x] #4000

lucasrodes avatar Feb 26 '25 10:02 lucasrodes

Hey @lucasrodes thanks a lot for starting this. For this specific issue, is there anything left to do, or do you want to consider it a tracking issue?

pabloarosado avatar Feb 27 '25 10:02 pabloarosado

Hey Pablo! So this issue is rather high-level, so probably belongs to the "tracking realm".

I'll keep an updated list of related PRs/issues in the description.

lucasrodes avatar Feb 27 '25 10:02 lucasrodes

Thanks @lucasrodes, I see some overlap with:

  • https://github.com/owid/etl/issues/3969
  • https://github.com/owid/etl/issues/3992

Feel free to integrate this into one of those already existing issues, to avoid too much dispersion.

pabloarosado avatar Feb 27 '25 10:02 pabloarosado

@pabloarosado The description of #3969 already points readers to #3992, so I'll go ahead and close it.

On https://github.com/owid/etl/issues/3992, I think it is more general than this one, since it also considers things like: update workflow, wizard app, csv-to-etl migration of explorers, etc. This issue instead is focussed on harmonizing the how we handle configs of ETL-based explorers and mdims in ETL, so they almost feel the same.

Note that this issue is actually mentioned in the description of #3992 (point 2).

I've edited a bit the description of this issue to be a bit more explicit.

lucasrodes avatar Feb 27 '25 11:02 lucasrodes

I've finished the first iteration of harmonizing the tooling for Explorers and MDIMs in https://github.com/owid/etl/pull/4035. COVID explorer and mdims now use this new logic.

Find below a more detailed description of the improvements from this work.

Model summary

I've abstracted config logic from Explorers and MDIMs, and put a model in place (very similar to what we have in owid.catalog.meta).

Diagram in etl.collections

flowchart LR
    %% Define nodes
    A[explorer]
    B[multidim]
    C[common]
    D[model]
    E[utils]

    %% Define edges
    A --> D
    A --> C
    A --> E

    B --> D
    B --> C
    B --> E

    C --> D

    D --> E

Summary of the module structure:

  • model
    Encapsulates the abstraction of the data model used by both explorer and multidim.
  • explorer & multidim
    Provide specialized tooling (e.g., logic, user-facing features) specific to Explorer and Multidimensional capabilities.
  • common
    Contains shared functionality and helper logic extracted from explorer and multidim so they can both utilize it.
  • utils
    Includes general-purpose utilities used by any module. Should remain independent (i.e., does not import from other modules) to prevent circular dependencies.

Examples

MDIM

upsert_multidim_data_page encapsulates logic on validation, processing and upserting to DB. https://github.com/owid/etl/blob/ff822ac73ed6b9d86bdc3a61cd4716ac880312d8/etl/steps/export/multidim/covid/latest/covid.py#L37-L45

Explorer

create_dataset encapsulates logic on validation, processing and uploading to owid-content. In the future, we should probably have a upsert_explorer method. https://github.com/owid/etl/blob/cbaefcdaff4a7ee1c510d9740fc1f8f1073cddeb/etl/steps/export/explorers/covid/latest/covid.py#L33-L46

Future work

  • We need a schema for Explorers, to validate the YAML config, like we do with MDIMs.
    • The config of explorers is still specific to explorers (minor differences now); I think we should try to modify it and bring it closer to MDIMs.
    • At the moment the source of truth for the "schema" of an explorer view seems to live in a TypeScript file. It should have its own JSON file!
  • We need a place in the DB for explorers (connected to above's point of defining a schema).
  • Test this changes by migrating some explorers to use this model.
  • Add some docs as it becomes clearer.
  • Wizard templating.
  • Test other kinds of explorers: code/yaml combinations.

lucasrodes avatar Mar 04 '25 13:03 lucasrodes

Thanks @lucasrodes, this is looking really good! One idea to make the workflow of explorers/mdims a bit more straightforward (more similar to any other ETL data step) would be to adapt helpers.PathFinder to handle the creation of the explorer/mdim. So, instead of the user having to import things from multidim or explorer (e.g. the upsert_... function), PathFinder could already know which one to use depending on the type of step. For example, we could either have a paths.create_explorer and a paths.create_multidim method, or just a common paths.create_collection. I'm not strongly opinionated about this, if you prefer to have very different types of codes for data steps, explorers and mdims, that's ok too. We can talk a bit more about this on Thursday, during shaping. Thanks for doing this work!

pabloarosado avatar Mar 04 '25 14:03 pabloarosado

Recent work to this space:

  • https://github.com/owid/etl/pull/4106
  • https://github.com/owid/etl/pull/4095

lucasrodes avatar Mar 12 '25 11:03 lucasrodes

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 11 '25 11:05 stale[bot]