etl icon indicating copy to clipboard operation
etl copied to clipboard

:hammer: engineering: refactor upserts to MySQL

Open Marigold opened this issue 5 months ago • 2 comments

Refactor upserts to MySQL from grapher:// step. This allows us to compare checksums of data & metadata by indicator and skip upserts to variables table in MySQL if metadata doesn't change (previously we only skipped uploading to R2). This should speed up cases where you only work on a single indicator.

It's not going to be any faster than our current solution if checksums differ. I'll try tackling that in the next PR where I switch from threads to asyncio.

Changes

  • Data & metadata checksums are now computed from dataframe and VariableMeta dataclass instead of the final JSON that we upload to R2.
  • Use hash_any to calculate checksums instead of md5 that we usually use.
  • Minimize set_index and reset_index operations.
  • Start using catalogPath more.
  • Deprecate old code.

Marigold avatar Sep 03 '24 19:09 Marigold