owid-grapher
owid-grapher copied to clipboard
Use data values from parquet when available
Background
We now generate Parquet files from the ETL project, for both ETL-generated and backported data. Parquet is a columnar format that's very well suited to quickly reading off large amounts of data. The data is currently published two ways: firstly, as a machine-readable remote catalog at https://catalog.ourworldindata.org/ and secondly to the data-catalog repo which can be cloned locally using Git-LFS.
Overall goal
We would like to reduce or remove our reliance on the data_values
table in MySQL, which we believe will make us more nimble in data management and in how we use the ETL to generate secondary data products.
Scope
This cycle
- [ ] Update live-preview for grapher and explorers to read from Parquet files when available
- [ ] #1685
- [ ] Allow the use of a local set of Parquet files or a remote catalog
- [ ] Ensure bake times remain acceptable
Optional extras
- [x] Lazily cache the remote catalog to the local disk, to speed up subsequent bakes (if needed)
- [x] #1697
- [ ] Use the same
catalogPath
in backporting no matter what metadata changes have been made to a dataset (for discussion) - [ ] Remove a variable's
data_values
once it has been successfully backported (for discussion)
Not this cycle
- ...
Open questions
- Should the ETL bake static JSON files eagerly, or should they remain generated by the baker?
Main PR that'd close this out is still blocked by Ubuntu upgrade on live, so here are some updates at least
Update live-preview for grapher and explorers to read from Parquet files when available
@larsyencken by explorers did you mean reading parquet files from client-side or something else? live-preview works fine
Allow the use of a local set of Parquet files or a remote catalog
Possible through .env
Ensure bake times remain acceptable
Since only a handful variables use parquet, bake times haven't changed. (It might be slightly longer in the future after we migrate everything to parquet)
Remove a variable's data_values once it has been successfully backported (for discussion)
PR with that and backporting is ready, but I wouldn't rush with it. First we remove ETL datasets from data_values
and then discuss whether removing all backported datasets would be helpful.
It looks like we'll create data.json files in ETL instead of constructing them on demand from parquet files - see https://github.com/owid/etl/issues/759