etl
etl copied to clipboard
:hammer: engineering: refactor upserts to MySQL
Refactor upserts to MySQL from grapher://
step. This allows us to compare checksums of data & metadata by indicator and skip upserts to variables
table in MySQL if metadata doesn't change (previously we only skipped uploading to R2). This should speed up cases where you only work on a single indicator.
It's not going to be any faster than our current solution if checksums differ. I'll try tackling that in the next PR where I switch from threads to asyncio.
Changes
- Data & metadata checksums are now computed from dataframe and
VariableMeta
dataclass instead of the final JSON that we upload to R2. - Use
hash_any
to calculate checksums instead ofmd5
that we usually use. - Minimize
set_index
andreset_index
operations. - Start using
catalogPath
more. - Deprecate old code.