etl icon indicating copy to clipboard operation
etl copied to clipboard

Tracking: roadmap for explorers and mdims

Open pabloarosado opened this issue 9 months ago • 2 comments

One-liner

Define our ETL workflow for Explorers and MDIMs while unifying tooling as much as possible.

(previous context: https://github.com/owid/etl/issues/3969)

Context: MDIM vs Explorers

We have different kinds of similar objects in etl/owid-content: See this spreadsheet ↗

While we want to adopt more and more MDIM pages, we will still have explorers around. This is because both objects are, conceptually, different things:

  • MDIM: It is a data page, which, like any other data page, speaks about one specific indicator. The only difference is that, in the MDIM case, the indicator has multiple dimensions.
  • Explorer: Can host multiple indicators with different meanings.

Therefore, we need to improve the data workflow to support both products.

Goals

1. MDIMs and Explorers should come from ETL

Given the context explained above, and after various discussions, we agree that we should move towards having both explorers and MDIMS be ETL-based (export://explorers/ and export://multidim/, respectively).

NOTE: Ideally, the explorer config should live in a table in DB (similar to the multi_dim_data_pages table) instead of a tsv file in owid-content (but this is a separate issue).

  • Migrate explorers (one-off)
    • All explorers that live in owid-content should be generated automatically from ETL export://explorers steps.
      • https://github.com/owid/etl/pull/4071
    • All CSV-based explorers should be converted into indicator-based explorers.
      • https://github.com/owid/owid-issues/issues/1850
      • https://github.com/owid/etl/issues/4072
  • Explorers as MDIMs?
    • Some explorers may be converted into multidim pages when appropriate.
    • Are there any specific explorers with low-hanging fruits to convert into mdims?

2. Standardize the tooling used in explorers and MDIMs

These two objects are very similar, and ideally, they should rely on standard tooling to minimize the maintenance burden. This implies some additional transition work in the coming months.

  • Are there functions already developed for MDIM pages that could be reused in existing indicator-based explorers?
    • https://github.com/owid/etl/issues/4032
    • https://github.com/owid/etl/pull/4035

3. Create a pleasant workflow experience for data scientists

  • Wizard: We should have a nice workflow in Wizard, where data scientists can easily create export steps (only explorers/mdims) from a generic template. Just as we can easily create data and snapshot steps from Wizard, we should be able to do the same for MDIMs and explorers.
    • #3980
  • Schema/docs: We should validate the schemas used for MDIMs and Explorers. An idea is to better structure our config information in pythonic data classes. At the same time, should power our docs (as with data and snapshots)
    • #3976
    • #3979
    • #3977
  • Dimensions:
    • https://github.com/owid/etl/issues/4007
    • #4107
  • Update workflow
    • #3956
  • Other issues
    • #3981

pabloarosado avatar Feb 17 '25 15:02 pabloarosado

I spent a good chunk of time browsing various explorers, and whoa... this isn't going to be easy. It feels like every explorer is unique, and there's no obvious way to have a single approach for everything. The only thing I can confidently say is that CSV-based explorers are bad (though that alone doesn’t justify spending time migrating them).

I'm still wrapping my head around everything, so take the following notes with a grain of salt.

1. MDIMs and Explorers Should Come from ETL

The main question is whether we'd allow editing explorers from Admin or not. If yes, we'd need either some kind of "override" in the Admin layer (either in owid-content or the DB) or a way to write changes back to ETL. (Remember that we did this for indicator metadata, and it's used very rarely.)

Explorers with many combinations, like minerals, are well suited for ETL, but more bespoke explorers, like migration, are much more complex. Then again, some people prefer YAML, while others prefer Python, and it's unclear whether we should enforce a single approach.

2. Standardize the Tooling Used in Explorers and MDIMs

@lucasrodes has already done this with the COVID explorer and COVID MDIM. The explorer YAML representation is really close to MDIMs. I can imagine generating a similar config file that could power both MDIMs and (indicator-based) explorers. If we can make it work for COVID, where we’re already pretty close, then it should be doable for anything. But does this grand unification bring enough value?

I guess we need a couple more MDIMs to better decide where to put our energy.

Appendix

Some explorers I found interesting:

  • Water and Sanitation – CSV-based explorer, could be worth migrating to indicator-based.
  • Monkeypox – CSV-based explorer, more bespoke. Could it be migrated to ETL, and would it be worth it?

Marigold avatar Feb 20 '25 09:02 Marigold

Thanks for the summary, @Marigold! You touch on very valid points.

Just to disclose my bias up front, my dream is to migrate all explorers and have a standardized way of doing things in the MDIM/explorer space, as we have for data steps.

My take is that this might not provide much value in the short term, but it will in the long term. I'm especially concerned with the update flow, where I think we should assume that everything is ETL-powered. So I don't think this is super urgent, yet a goal that would be great to have in, say, 1-2 year time.

In general, I think that deprecating CSV-based explorers (and chart-based) will help us maintain our infrastructure in the long run. It's annoying when developing tools to account for all these edge cases that do not come from ETL.

1. MDIMs and Explorers Should Come from ETL

I think we should probably create an issue with all explorers and rank them somehow by type or complexity. Also, whenever attempting to "migrate" one, we should advertise it to avoid conflicts with other edits.

One risk here is that the data scientist in charge of this explorer might be used to their current pipeline, so we should make sure that the new indicator-based is easy to understand and with appropriate tooling. I think it could make sense to do this after agreeing on some templating (as in MDIMs) in point 2 below.

2. Standardize the Tooling Used in Explorers and MDIMs

I am happy to look at the COVID explorer again and see how the MDIM tooling/approach can be applied there.

I think that we could possibly need some engineering work here, to add some of the features that we have on MDIMs now (being able to reference them by catalogPath, display settings per view, etc.) Basically, it'd be nice to improve the explorer config API on the engineering side and align it with MDIMs a bit.

lucasrodes avatar Feb 20 '25 09:02 lucasrodes

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 18 '25 09:07 stale[bot]