pdr-backend
pdr-backend copied to clipboard
[Lake][Jobs][Incremental Pipeline] Use checkpoints and job metadata rather than inferring job progress through data.
Problem/motivation
i also had written this out somewhere... but basically, we're not keeping record of our last_run_ts. i had mentioned in the past that we'll eventually need to track start/end timestamp for the GQL/ETL workflows...
| Run 1 | Run 2 | Run 3 | |
|---|---|---|---|
| Time | 1:00 | 2:00 | 3:00 |
to initially solve this and move us forward (like ohlcv data factory), i wrote some logic such that we just query this data from our tables in order to identify the last_run_checkpoint.
to solve this in the past, i had proposed to simply mutating my_ppss.yaml such that lake st_ts is modified w/ the last_run_timestamp and the pipeline enforces being incremental rather than trying to do it through data-inference ... but, this kind of breaks the pattern for how the yaml file is being used (engine modifying it).
further, it doesn't provide a way to track the etl/workflow runs, such that they can be rolledback/operated in a systemic way
Proposed solution (a)
start tracking the jobs metadata inside a jobs table in duckdb:
- id
- job_name
- job start ts
- job end ts
- input metadata (filename, table, rows, ppss st_ts, ppss end_ts, slot, whatever)
- output metadata (filename, table, rows, ppss st_ts, ppss end_ts, slot, whatever)
and use this data to understand how to resume/rollback/operate jobs on the lake
Proposed solution (b)
Just modify the ppss.yaml lake.st_ts when the job ends.
If the user runs a CLI command to roll back the pipeline, lake.st_ts should also be updated to reflect the state of the lake.
This is a KISS solution that lets us keep the SLA small.
Current solution (c)
Use min & max from pdr_predictions as the checkpoint for where the data has-been-written-to.
All other tables should be fetching and updating from this marker.
DoD:
Save ETL & workflow metadata to operate the lake.
Tasks
- [ ] Stop using data-inference to start/resuming gqldf/etl jobs
- [ ] Implement another way to manage incremental runs that are easy to operate from the CLI.
I have concerns about the solution b, editing the ppss.yaml file with the script is not an expected behaviour
Tracking issue in #1299 and closing this here