gitcoin-grants-data-portal
gitcoin-grants-data-portal copied to clipboard
Make pipeline incremental
The main idea is to rely on the latest portal data and run smaller incremental on CI. We should provide a --full-refresh
flag ala dbt
to make data from scratch.
This is a big one!
The ideal approach I can think of would be to rely on Dagster partitions and sensors.
- Read the data from IPFS (or github actions cache!)
- Run Dagster sensors to check which partitions are missing.
- Run code for missing partitions and rematerialize datasets.
Perhaps there is a much easier approach we can use while we figure out all thhe Dasgter stuff.
Thinking about relying on external assets. Make the previous run the external assets and compute the diff using sensors?
We could also attach to the previous database and use it as the current state. Run sensors and then the remaining partitions.