gitcoin-grants-data-portal icon indicating copy to clipboard operation
gitcoin-grants-data-portal copied to clipboard

Make pipeline incremental

Open davidgasquez opened this issue 1 year ago • 3 comments

The main idea is to rely on the latest portal data and run smaller incremental on CI. We should provide a --full-refresh flag ala dbt to make data from scratch.

This is a big one!

davidgasquez avatar Jan 05 '24 18:01 davidgasquez

The ideal approach I can think of would be to rely on Dagster partitions and sensors.

  1. Read the data from IPFS (or github actions cache!)
  2. Run Dagster sensors to check which partitions are missing.
  3. Run code for missing partitions and rematerialize datasets.

Perhaps there is a much easier approach we can use while we figure out all thhe Dasgter stuff.

davidgasquez avatar Jan 05 '24 18:01 davidgasquez

Thinking about relying on external assets. Make the previous run the external assets and compute the diff using sensors?

davidgasquez avatar Jan 09 '24 09:01 davidgasquez

We could also attach to the previous database and use it as the current state. Run sensors and then the remaining partitions.

davidgasquez avatar Jan 19 '24 17:01 davidgasquez