oso icon indicating copy to clipboard operation
oso copied to clipboard

Partition the goldsky dagster assets

Open ravenac95 opened this issue 1 year ago • 1 comments

What is it?

Goldsky assets are currently monolithically loaded from raw files into temporary tables then merged into the final (see below).

If we partition hourly/daily such that we are constantly tracking each set of files as an asset. We can more reliably retry asset materializations as well as tracking the files to clean (once we determine a partition is safe to clean). Then from there we can consistently attempt to remerge the asset files into the fully merged form if there are failures. This can either be done by making the last two boxes into their own assets or as a single asset. Once the merged asset has completed the files from the first asset can then be safely deleted from our gcs bucket and we can clean the partitions in dagster as well.

ravenac95 avatar Jun 18 '24 17:06 ravenac95

According to @ravenac95 this is a nice to have, not necessary until we start experiencing bugs in how we do pointer management in BigQuery, adding partitioned keys (for every hour) that Dagster could understand

The conditions where this could be useful:

  1. If we find a lot of missing data, accidentally deleting data - exact tracking on which files to reload

ryscheng avatar Jul 02 '24 17:07 ryscheng