pdr-backend [Lake][Jobs][Incremental Pipeline] Use checkpoints and job metadata rather than inferring job progress through data.

[Lake][Jobs][Incremental Pipeline] Use checkpoints and job metadata rather than inferring job progress through data.

Open idiom-bytes opened this issue 1 year ago • 1 comments

Problem/motivation

i also had written this out somewhere... but basically, we're not keeping record of our last_run_ts. i had mentioned in the past that we'll eventually need to track start/end timestamp for the GQL/ETL workflows...

	Run 1	Run 2	Run 3
Time	1:00	2:00	3:00

to initially solve this and move us forward (like ohlcv data factory), i wrote some logic such that we just query this data from our tables in order to identify the last_run_checkpoint.

to solve this in the past, i had proposed to simply mutating my_ppss.yaml such that lake st_ts is modified w/ the last_run_timestamp and the pipeline enforces being incremental rather than trying to do it through data-inference ... but, this kind of breaks the pattern for how the yaml file is being used (engine modifying it).

further, it doesn't provide a way to track the etl/workflow runs, such that they can be rolledback/operated in a systemic way

Proposed solution (a)

start tracking the jobs metadata inside a jobs table in duckdb:

id
job_name
job start ts
job end ts
input metadata (filename, table, rows, ppss st_ts, ppss end_ts, slot, whatever)
output metadata (filename, table, rows, ppss st_ts, ppss end_ts, slot, whatever)

and use this data to understand how to resume/rollback/operate jobs on the lake

Proposed solution (b)

Just modify the ppss.yaml lake.st_ts when the job ends.

If the user runs a CLI command to roll back the pipeline, lake.st_ts should also be updated to reflect the state of the lake.

This is a KISS solution that lets us keep the SLA small.

Current solution (c)

Use min & max from pdr_predictions as the checkpoint for where the data has-been-written-to.

All other tables should be fetching and updating from this marker.

DoD:

Save ETL & workflow metadata to operate the lake.

Tasks

[ ] Stop using data-inference to start/resuming gqldf/etl jobs
[ ] Implement another way to manage incremental runs that are easy to operate from the CLI.

May 01 '24 15:05 idiom-bytes

I have concerns about the solution b, editing the ppss.yaml file with the script is not an expected behaviour

May 02 '24 11:05 kdetry

Tracking issue in #1299 and closing this here

Jun 25 '24 16:06 idiom-bytes

pdr-backend pdr-backend copied to clipboard

[Lake][Jobs][Incremental Pipeline] Use checkpoints and job metadata rather than inferring job progress through data.

Problem/motivation

Proposed solution (a)

Proposed solution (b)

Current solution (c)

DoD:

Tasks

pdr-backend
pdr-backend copied to clipboard