dlt icon indicating copy to clipboard operation
dlt copied to clipboard

local dlt pipeline cli runner

Open rudolfix opened this issue 1 year ago • 3 comments

Background We are looking for convenient way to execute dlt pipelines from command line, possibly with minimal / without additional code.

There are two options to investigate (not mutually exclusive, we just need to start somewhere):

  1. a command to run any source or resource from a specified module (no additional user code needed)
  2. a command to run and instance of the pipeline in the user pipeline script (some user code required: import a source/resource, pass the parameters and instantiate source, instantiate a pipeline with a given name etc.)

In case of (1) user would specify the name and module with the source(s) and then parameters to instantiate source class (we can use fire lib to create cli interfaces automatically for source/resource functions: https://github.com/google/python-fire), the command would create instance of dlt pipeline, attach a destination and dataset to it, import the desired source, create an instance with passed parameters and then run it (see below).

In case of (2) user would write a pipeline script where source(s) and pipeline are instantiated and then would pass the name of the script, pipeline and source(s) names to the script which will run them. (in this case overriding destination/dataset etc. for the actual run so it is possible to switch from dev destination to production one)

both 1 and 2 have a few common features that we already have in our airflow helper: (https://github.com/dlt-hub/dlt/blob/master/dlt/helpers/airflow_helper.py#L39 and runner example: )

  • option to select/deselect resources being loaded
  • option to force full load: (set all resource to replace)
  • option to setup buffer sizes and parallelism
  • option to retry extract/load/normalize stages
  • option to abort if any job failed
  • option to load lineage data together with a dataset
  • probably many others

An option to backfil could be available for resources that use Incremental class for incremental loading and are aware of external schedulers, In that case a start and end value could be passed from cli (not only dates but also timestamps or integers - whatever is used as incremental cursor)

rudolfix avatar Dec 01 '23 17:12 rudolfix

Nice! I'm definitely for 2) as this is something I'm already doing. A few things that lean me towards this :

  • It's easier to understand how the code flow through a defined pipeline object
  • There are (often?) some custom/extra steps that would be included in a pipeline definition (e.g loading some specific creds, etc) so source/resource won't be really enough IMO.
  • I feel it somehow dangerous to be able from the CLI to run from any source to any destination. I would rather have strict pipeline and expose only this to the CLI.

mehd-io avatar Dec 04 '23 08:12 mehd-io

Hi @rudolfix, and thanks for considering our feedback from Slack.

Like @mehd-io, I'm already using your second scenario with a custom CLI wrapper based on Click, so I'd support that option as well.

I also agree with @mehd-io's concerns about the first option. Furthermore, my own concern is that creating a CLI runner for a source/resource might lead to confusion, especially for those less familiar with the tool. In this approach, a pipeline is created behind the scenes, but on the surface it might blur the distinction between a pipeline and a source/resource, as the latter might also function as a pipeline in practice, given the option to run it as such.

Just my 2 cents!

geo909 avatar Dec 04 '23 10:12 geo909

@sultaniman please read the code in dlt.cli namespace so you'll see that a lot of things are done runner example: https://dlthub.com/devel/examples/chess_production/

rudolfix avatar Feb 22 '24 10:02 rudolfix