use a dummy operator at the start of parallel pipelines
Description
Used a DummyOperator instead of the first source to parallelize all sources, including the first one.
Related Issues
- Fixes #2196
Additional Context
The first source can take a long time to run, this can make pipelines faster by parallelizing even the first source.
Deploy Preview for dlt-hub-docs canceled.
| Name | Link |
|---|---|
| Latest commit | 7fc82a3fce1d19f9f2b8fda0edcfbb0095f661cd |
| Latest deploy log | https://app.netlify.com/sites/dlt-hub-docs/deploys/677e7bfdb52f8300086fd4d2 |
@alucryd there's a reason to run first task and then all others in parallel: it will create initial schema in the database and standard dlt tables. all tasks share the same dataset.
if you still want to work on this PR then let's add new option to
add_run: iedummy_task_firstand if set to True, do what do right now.I do not want to change existing behavior, too many deployments that may rely on that are in production
I see, thanks for the heads up, I don't have the full picture yet but I'm getting there. I ran this change in production and didn't run into any issue with a completely new datasource so I wrongly assumed it would be harmless.
I assume it would be too much work to split the schema and table creations and only run that in the first task?
In any case I'll add the proposed option and default it to false so it doesn't impact anyone.
@alucryd yeah we could think of some "preparatory" task but IMO in that case it is better to just create a callback that receives a DAG from airflow helper and can modify it... we already have on_before_run we could also add on_dag_created where you get this tree of tasks.
but that's a separate ticket I'd say - if you'd like to try to add it
@alucryd do you plan to continue on this?
Hi, @rudolfix @alucryd I would like to give this a shot, if it is alright :)
@prakharcode I will close this PR for no activity, if you'd like to continue it or provide a new one, please re-open this or create a new PR. Thanks :)