etl icon indicating copy to clipboard operation
etl copied to clipboard

Upsert to database when running `--grapher` step only if metadata checksum differs

Open Marigold opened this issue 6 months ago • 8 comments

Problem

Some grapher steps can take annoyingly long, especially when on poor internet connection. For instance

etlr grapher://grapher/minerals/2024-07-15/minerals --grapher --force --only

can take anywhere between 20s and 2min depending on your connection (it has ~200 indicators, which is quite a lot). This is significantly slower than working with local MySQL.

Possible solutions

Use GRAPHER_FILTER to only upsert indicators you care about

We already support GRAPHER_FILTER=indicator etlr ... --grapher that only upserts indicators matching the filter. This is useful when tuning the chart for a single indicator.

Only upsert indicators if metadataChecksum differs

This already works for uploading data & metadata JSON files to R2. If checksums match, we just skip the step. It could work similarly, but there are a couple of potential issues:

  • metadataChecksum is computed from dict that contains IDs that we got from MySQL after inserting them (e.g. origins). We'd have to redefine it
  • First, we upsert metadata to MySQL and then fetch it back and create JSON file. We should simplify it and create JSON file without waiting to MySQL upserts.
  • Right now, every run checks all links to variables, docs, etc. that would be skipped
  • Any other side effects?

Switch to async

Grapher upserts are parallelized by threads on the upper level. This works reasonably well, but debugging sucks and it is not as parallel and efficient as possible. Switching to asyncio SQLAlchemy could help.

Marigold avatar Aug 26 '24 08:08 Marigold