etl
etl copied to clipboard
Upsert to database when running `--grapher` step only if metadata checksum differs
Problem
Some grapher steps can take annoyingly long, especially when on poor internet connection. For instance
etlr grapher://grapher/minerals/2024-07-15/minerals --grapher --force --only
can take anywhere between 20s and 2min depending on your connection (it has ~200 indicators, which is quite a lot). This is significantly slower than working with local MySQL.
Possible solutions
Use GRAPHER_FILTER
to only upsert indicators you care about
We already support GRAPHER_FILTER=indicator etlr ... --grapher
that only upserts indicators matching the filter. This is useful when tuning the chart for a single indicator.
Only upsert indicators if metadataChecksum
differs
This already works for uploading data & metadata JSON files to R2. If checksums match, we just skip the step. It could work similarly, but there are a couple of potential issues:
-
metadataChecksum
is computed from dict that contains IDs that we got from MySQL after inserting them (e.g. origins). We'd have to redefine it - First, we upsert metadata to MySQL and then fetch it back and create JSON file. We should simplify it and create JSON file without waiting to MySQL upserts.
- Right now, every run checks all links to variables, docs, etc. that would be skipped
- Any other side effects?
Switch to async
Grapher upserts are parallelized by threads on the upper level. This works reasonably well, but debugging sucks and it is not as parallel and efficient as possible. Switching to asyncio SQLAlchemy could help.