owid-grapher
owid-grapher copied to clipboard
Update baking to read from parquet
If variables in MySQL have non-null field catalogPath
, fetch them from parquet file from data catalog with DuckDB. It has been tested on real data, but there are no unit tests yet (I'm waiting for confirmation that my approach is solid). Please let me know about any style or structure issues.
Issues
- [ ] Latest duckdb package needs glibc 2.29 to run which is not available on Ubuntu 18.04
- [ ] ~~Cloudflare doesn't support range requests so we have to make requests directly to S3 (which isn't distributed and costs us $)~~
Performance
There are two computation heavy parts of baking - baking variables and baking charts. Baking variables (please correct me if I'm wrong) takes much less time than baking charts and workerpool which we're using there doesn't improve it dramatically (async is fast since fetching data_values from MySQL is the bottleneck). So even if we get slower variables baking, it's not a bottleneck. Only ~200 variables have valid catalogPath
and which I've used to measure performance.
-
1s - fetching values from
data_values
table - 1.2s - fetching from local parquet files
- 6s - fetching from remote parquet files (tested from DigitalOcean server! local baking will be limited by bandwidth)
Using processes in workerpool is crucial as duckdb doesn't run in worker threads (getting segfaults). It also seem to block async operation for some reason, so using processes is essential here. I haven't tried duckdb-wasm as it is apparently not recommended.
If downloading parquet files from remote is a bottleneck, we can clone data-catalog and get the data locally from there.
Performance update
After adding all backported variables, baking variables from local catalog takes ~120s (same as from MySQL) and ~1hour from remote catalog (bandwidth is the bottleneck, it would be faster if we fixed CloudFlare workers). In summary, fetching data from parquet doesn't seem to be a problem.
(I've also tried running separate server / API around duckdb in cluster mode which was pretty fast too, but then I got workerpool working with processes)
Dev workflow and deploys from ETL
Since we won't be using data_values
for storing data, how is ETL development going to look like? Right now you can run etl --grapher
to upsert data into data_values
and then check them out from local admin. This will be replaced by parquet files from your local data catalog. You'll have to point CATALOG_PATH
to your local ETL catalog and if using docker, create a symlink in grapher and point it there.
We'll also need to switch from manual ETL deploys (i.e. running grapher) to automatic, because we need ETL to build a catalog with parquet files first. After every merge to ETL we'll republish our catalog and run etl --grapher
to upsert everything to production (/staging).
An alternative would be to keep using mix of data_values & parquet files, but that'd become messy very soon.
I'm using my own script to test this out (since baking all variables takes a long time). Is there a way / point in sharing it with others and committing to repo (e.g. as a "profiling script")?
Thanks a lot for review @ikesau, I'm gonna add some tests now. I'll also update PR description with how we want devs to use this with ETL.
@ikesau added some tests (there were none before). Do you think it is ready for approval? (I'm not gonna merge it until we upgrade Ubuntu anyway)
Just a note on CloudFlare range requests. I think you can support them via a CloudFlare worker. The worker has to fetch the full content from the cache (i.e. the entire parquet file), then it can fetch the range from the file. https://community.cloudflare.com/t/range-requests-and-the-cache-api/263252
I was more worried about the cache size, but as long as our Parquet file is below 512MB it seems it can be cached by CloudFlare: https://developers.cloudflare.com/cache/about/default-cache-behavior/#:~:text=Cloudflare%20cacheable%20file%20limits%3A,to%20request%20a%20limit%20increase.
Thanks for looking it up for me :). I was planning to look into service workers and range requests during cooldown.
I made a new issue for it in the engineering repo now.
Thinking about this one some more, on the operational side, seems live live
needs a local clone of data-catalog
and to to simply to a git pull
before every deploy.
We can make a new issue for that.