dlt
dlt copied to clipboard
[WIP] allow to fork and split pipelines
Background split pipeline - create pipeline with a new name and move indicated sources/resources/tables to it. extracted and normalized files belonging to resources' tables are moved, schemas and state are split and moved fork pipeline - as above but create a copy (or hard link) of files, schemas and state are split and copied
To fully implement "bad data handling" (#780 ) we need to be able to split pipeline on bad data resource / table and load it to a separate dataset or a destination
For reverse ETL (custom destinations) we want to send some resources twice (or more) to several destinations.
A full fork may be also used as a backup #944
the process Pipeline may be split after extract or normalize steps. We should not allow to split/fork pipelines that have
- all packages in working directory must be split / forked
We should not allow split/fork when there are still packages in the load
phase (in "original" and "other" pipeline)
state and schema splitting
The "original" pipeline should keep the full state and schema to be able to restore it. Tables for which resources do not exist will not be created.
The "other" pipeline should receive only source state and state belonging to the split/forked resources.
Package state may be cloned (we need to make sure that new refresh
work correctly though)
note 1: pipeline state mutates only during extract
step so we are fine to override sources/resources state on split/fork when state already exists.
note 2: schemas mutate during extract and normalize steps. if the pipeline was cloned after extract step, we just initialize the schemas in the "other" pipeline. if the pipeline was cloned after normalize step, we can overwrite (of course source-wise)
user interface
split
and fork
methods on a Pipeline
. They should accept "other" pipeline name, and optionally: destination/staging and dataset name.