owid-grapher Use data values from parquet when available

Background

We now generate Parquet files from the ETL project, for both ETL-generated and backported data. Parquet is a columnar format that's very well suited to quickly reading off large amounts of data. The data is currently published two ways: firstly, as a machine-readable remote catalog at https://catalog.ourworldindata.org/ and secondly to the data-catalog repo which can be cloned locally using Git-LFS.

Overall goal

We would like to reduce or remove our reliance on the data_values table in MySQL, which we believe will make us more nimble in data management and in how we use the ETL to generate secondary data products.

Scope

This cycle

[ ] Update live-preview for grapher and explorers to read from Parquet files when available
[ ] #1685
[ ] Allow the use of a local set of Parquet files or a remote catalog
[ ] Ensure bake times remain acceptable

Optional extras

[x] Lazily cache the remote catalog to the local disk, to speed up subsequent bakes (if needed)
[x] #1697
[ ] Use the same catalogPath in backporting no matter what metadata changes have been made to a dataset (for discussion)
[ ] Remove a variable's data_values once it has been successfully backported (for discussion)

Not this cycle

...

Open questions

Should the ETL bake static JSON files eagerly, or should they remain generated by the baker?

Sep 20 '22 12:09 larsyencken

Main PR that'd close this out is still blocked by Ubuntu upgrade on live, so here are some updates at least

Update live-preview for grapher and explorers to read from Parquet files when available

@larsyencken by explorers did you mean reading parquet files from client-side or something else? live-preview works fine

Allow the use of a local set of Parquet files or a remote catalog

Possible through .env

Ensure bake times remain acceptable

Since only a handful variables use parquet, bake times haven't changed. (It might be slightly longer in the future after we migrate everything to parquet)

Remove a variable's data_values once it has been successfully backported (for discussion)

PR with that and backporting is ready, but I wouldn't rush with it. First we remove ETL datasets from data_values and then discuss whether removing all backported datasets would be helpful.

Oct 13 '22 12:10 Marigold

It looks like we'll create data.json files in ETL instead of constructing them on demand from parquet files - see https://github.com/owid/etl/issues/759

Jan 10 '23 16:01 danyx23

owid-grapher owid-grapher copied to clipboard

Use data values from parquet when available

Background

Overall goal

Scope

This cycle

Optional extras

Not this cycle

Open questions

owid-grapher
owid-grapher copied to clipboard