owid-grapher Update baking to read from parquet

If variables in MySQL have non-null field catalogPath, fetch them from parquet file from data catalog with DuckDB. It has been tested on real data, but there are no unit tests yet (I'm waiting for confirmation that my approach is solid). Please let me know about any style or structure issues.

Issues

[ ] Latest duckdb package needs glibc 2.29 to run which is not available on Ubuntu 18.04
[ ] ~~Cloudflare doesn't support range requests so we have to make requests directly to S3 (which isn't distributed and costs us $)~~

Performance

There are two computation heavy parts of baking - baking variables and baking charts. Baking variables (please correct me if I'm wrong) takes much less time than baking charts and workerpool which we're using there doesn't improve it dramatically (async is fast since fetching data_values from MySQL is the bottleneck). So even if we get slower variables baking, it's not a bottleneck. Only ~200 variables have valid catalogPath and which I've used to measure performance.

1s - fetching values from data_values table
1.2s - fetching from local parquet files
6s - fetching from remote parquet files (tested from DigitalOcean server! local baking will be limited by bandwidth)

Using processes in workerpool is crucial as duckdb doesn't run in worker threads (getting segfaults). It also seem to block async operation for some reason, so using processes is essential here. I haven't tried duckdb-wasm as it is apparently not recommended.

If downloading parquet files from remote is a bottleneck, we can clone data-catalog and get the data locally from there.

Performance update

After adding all backported variables, baking variables from local catalog takes ~120s (same as from MySQL) and ~1hour from remote catalog (bandwidth is the bottleneck, it would be faster if we fixed CloudFlare workers). In summary, fetching data from parquet doesn't seem to be a problem.

(I've also tried running separate server / API around duckdb in cluster mode which was pretty fast too, but then I got workerpool working with processes)

Dev workflow and deploys from ETL

Since we won't be using data_values for storing data, how is ETL development going to look like? Right now you can run etl --grapher to upsert data into data_values and then check them out from local admin. This will be replaced by parquet files from your local data catalog. You'll have to point CATALOG_PATH to your local ETL catalog and if using docker, create a symlink in grapher and point it there.

We'll also need to switch from manual ETL deploys (i.e. running grapher) to automatic, because we need ETL to build a catalog with parquet files first. After every merge to ETL we'll republish our catalog and run etl --grapher to upsert everything to production (/staging).

An alternative would be to keep using mix of data_values & parquet files, but that'd become messy very soon.

Oct 05 '22 16:10 Marigold

I'm using my own script to test this out (since baking all variables takes a long time). Is there a way / point in sharing it with others and committing to repo (e.g. as a "profiling script")?

Oct 05 '22 16:10 Marigold

Thanks a lot for review @ikesau, I'm gonna add some tests now. I'll also update PR description with how we want devs to use this with ETL.

Oct 10 '22 10:10 Marigold

@ikesau added some tests (there were none before). Do you think it is ready for approval? (I'm not gonna merge it until we upgrade Ubuntu anyway)

Oct 12 '22 11:10 Marigold

Just a note on CloudFlare range requests. I think you can support them via a CloudFlare worker. The worker has to fetch the full content from the cache (i.e. the entire parquet file), then it can fetch the range from the file. https://community.cloudflare.com/t/range-requests-and-the-cache-api/263252

I was more worried about the cache size, but as long as our Parquet file is below 512MB it seems it can be cached by CloudFlare: https://developers.cloudflare.com/cache/about/default-cache-behavior/#:~:text=Cloudflare%20cacheable%20file%20limits%3A,to%20request%20a%20limit%20increase.

Oct 13 '22 15:10 larsyencken

Thanks for looking it up for me :). I was planning to look into service workers and range requests during cooldown.

Oct 14 '22 13:10 Marigold

I made a new issue for it in the engineering repo now.

Oct 14 '22 13:10 larsyencken

Thinking about this one some more, on the operational side, seems live live needs a local clone of data-catalog and to to simply to a git pull before every deploy.

We can make a new issue for that.

Oct 17 '22 13:10 larsyencken

owid-grapher owid-grapher copied to clipboard

Update baking to read from parquet

Issues

Performance

Performance update

Dev workflow and deploys from ETL

owid-grapher
owid-grapher copied to clipboard