owid-grapher icon indicating copy to clipboard operation
owid-grapher copied to clipboard

Use data values from parquet when available

Open larsyencken opened this issue 2 years ago • 1 comments

Background

We now generate Parquet files from the ETL project, for both ETL-generated and backported data. Parquet is a columnar format that's very well suited to quickly reading off large amounts of data. The data is currently published two ways: firstly, as a machine-readable remote catalog at https://catalog.ourworldindata.org/ and secondly to the data-catalog repo which can be cloned locally using Git-LFS.

Overall goal

We would like to reduce or remove our reliance on the data_values table in MySQL, which we believe will make us more nimble in data management and in how we use the ETL to generate secondary data products.

Scope

This cycle

  • [ ] Update live-preview for grapher and explorers to read from Parquet files when available
  • [ ] #1685
  • [ ] Allow the use of a local set of Parquet files or a remote catalog
  • [ ] Ensure bake times remain acceptable

Optional extras

  • [x] Lazily cache the remote catalog to the local disk, to speed up subsequent bakes (if needed)
  • [x] #1697
  • [ ] Use the same catalogPath in backporting no matter what metadata changes have been made to a dataset (for discussion)
  • [ ] Remove a variable's data_values once it has been successfully backported (for discussion)

Not this cycle

  • ...

Open questions

  • Should the ETL bake static JSON files eagerly, or should they remain generated by the baker?

larsyencken avatar Sep 20 '22 12:09 larsyencken

Main PR that'd close this out is still blocked by Ubuntu upgrade on live, so here are some updates at least

Update live-preview for grapher and explorers to read from Parquet files when available

@larsyencken by explorers did you mean reading parquet files from client-side or something else? live-preview works fine

Allow the use of a local set of Parquet files or a remote catalog

Possible through .env

Ensure bake times remain acceptable

Since only a handful variables use parquet, bake times haven't changed. (It might be slightly longer in the future after we migrate everything to parquet)

Remove a variable's data_values once it has been successfully backported (for discussion)

PR with that and backporting is ready, but I wouldn't rush with it. First we remove ETL datasets from data_values and then discuss whether removing all backported datasets would be helpful.

Marigold avatar Oct 13 '22 12:10 Marigold

It looks like we'll create data.json files in ETL instead of constructing them on demand from parquet files - see https://github.com/owid/etl/issues/759

danyx23 avatar Jan 10 '23 16:01 danyx23