etl
etl copied to clipboard
MDIM/Explorer config harmonization
Currently, we have tooling for MDIMs and ETL-export Explorers in etl.multidim and etl.explorer. However, these two modules could share some logic. The current structure is not very suitable for this.
Generally, this tooling is used for handling config files, which ideally should be very close to one another.
Proposal
- New module:
etl.collections. In it, we have:etl.collections.base(preliminar name): which contains common logic.etl.collections.utils(preliminar name): which contains smaller util functions.etl.collections.multidim: MDIM-specific logicetl.collections.explorer: Explorer-specific logic.- etc.
Others:
- We need classes for the config of MDIMs and Explorers (wrappers around the plain YML config).
- We need to figure out a way to load the enriched config from a step. Currently, we rely on
paths.load_mdim_config, which returns a plain dictionary. This function is inetl.helpers, which can't really import frometl.collections.multidim, since this latter imports the former already.
Related work
- https://github.com/owid/etl/pull/4030
- https://github.com/owid/etl/pull/4035
- https://github.com/owid/etl/pull/4106
- https://github.com/owid/etl/pull/4095
Partly addressing this in https://github.com/owid/etl/pull/4030
Let us know when you plan to move files or a big restructure, please. There are some open PRs for mdims and rebasing them could get nasty.
I've merged my changes from https://github.com/owid/etl/pull/4030, which have put in place the space of etl.collections, and moved some of the logic we had in etl.multidim in there.
I'm resolving conflicts now:
- [x] #4000
Hey @lucasrodes thanks a lot for starting this. For this specific issue, is there anything left to do, or do you want to consider it a tracking issue?
Hey Pablo! So this issue is rather high-level, so probably belongs to the "tracking realm".
I'll keep an updated list of related PRs/issues in the description.
Thanks @lucasrodes, I see some overlap with:
- https://github.com/owid/etl/issues/3969
- https://github.com/owid/etl/issues/3992
Feel free to integrate this into one of those already existing issues, to avoid too much dispersion.
@pabloarosado The description of #3969 already points readers to #3992, so I'll go ahead and close it.
On https://github.com/owid/etl/issues/3992, I think it is more general than this one, since it also considers things like: update workflow, wizard app, csv-to-etl migration of explorers, etc. This issue instead is focussed on harmonizing the how we handle configs of ETL-based explorers and mdims in ETL, so they almost feel the same.
Note that this issue is actually mentioned in the description of #3992 (point 2).
I've edited a bit the description of this issue to be a bit more explicit.
I've finished the first iteration of harmonizing the tooling for Explorers and MDIMs in https://github.com/owid/etl/pull/4035. COVID explorer and mdims now use this new logic.
Find below a more detailed description of the improvements from this work.
Model summary
I've abstracted config logic from Explorers and MDIMs, and put a model in place (very similar to what we have in owid.catalog.meta).
Diagram in etl.collections
flowchart LR
%% Define nodes
A[explorer]
B[multidim]
C[common]
D[model]
E[utils]
%% Define edges
A --> D
A --> C
A --> E
B --> D
B --> C
B --> E
C --> D
D --> E
Summary of the module structure:
- model
Encapsulates the abstraction of the data model used by both explorer and multidim. - explorer & multidim
Provide specialized tooling (e.g., logic, user-facing features) specific to Explorer and Multidimensional capabilities. - common
Contains shared functionality and helper logic extracted from explorer and multidim so they can both utilize it. - utils
Includes general-purpose utilities used by any module. Should remain independent (i.e., does not import from other modules) to prevent circular dependencies.
Examples
MDIM
upsert_multidim_data_page encapsulates logic on validation, processing and upserting to DB.
https://github.com/owid/etl/blob/ff822ac73ed6b9d86bdc3a61cd4716ac880312d8/etl/steps/export/multidim/covid/latest/covid.py#L37-L45
Explorer
create_dataset encapsulates logic on validation, processing and uploading to owid-content. In the future, we should probably have a upsert_explorer method.
https://github.com/owid/etl/blob/cbaefcdaff4a7ee1c510d9740fc1f8f1073cddeb/etl/steps/export/explorers/covid/latest/covid.py#L33-L46
Future work
- We need a schema for Explorers, to validate the YAML config, like we do with MDIMs.
- The config of explorers is still specific to explorers (minor differences now); I think we should try to modify it and bring it closer to MDIMs.
- At the moment the source of truth for the "schema" of an explorer view seems to live in a TypeScript file. It should have its own JSON file!
- We need a place in the DB for explorers (connected to above's point of defining a schema).
- Test this changes by migrating some explorers to use this model.
- Add some docs as it becomes clearer.
- Wizard templating.
- Test other kinds of explorers: code/yaml combinations.
Thanks @lucasrodes, this is looking really good! One idea to make the workflow of explorers/mdims a bit more straightforward (more similar to any other ETL data step) would be to adapt helpers.PathFinder to handle the creation of the explorer/mdim. So, instead of the user having to import things from multidim or explorer (e.g. the upsert_... function), PathFinder could already know which one to use depending on the type of step. For example, we could either have a paths.create_explorer and a paths.create_multidim method, or just a common paths.create_collection. I'm not strongly opinionated about this, if you prefer to have very different types of codes for data steps, explorers and mdims, that's ok too.
We can talk a bit more about this on Thursday, during shaping. Thanks for doing this work!
Recent work to this space:
- https://github.com/owid/etl/pull/4106
- https://github.com/owid/etl/pull/4095
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.