dlt
dlt copied to clipboard
creates a single source in extract for all resource instances passed as list
Description
We discovered peculiar problem with rest_api when users were passing a list of resources to the run function that contained a resource and a transformer:
page = 1
@dlt.resource(name="pages")
def gen_pages():
nonlocal page
while True:
yield {"page": page}
if page == 10:
return
page += 1
@dlt.transformer(name="subpages")
def get_subpages(page_item):
yield from [
{
"page": page_item["page"],
"subpage": subpage,
}
for subpage in range(1, 11)
]
pipeline = dlt.pipeline("test_resource_transformer_standalone", destination="duckdb")
# here we must combine resources and transformers using the same instance
info = pipeline.run([gen_pages, gen_pages | get_subpages])
in the case above only last page is passed to the transformer (see the commits for tests and details)
the root cause is that each resource in the list was packaged in a separate source and extracted separately. that prevented any DAG optimizations and gen_pages was extracted twice.
here we change the behavior where a single dlt source is used to extract all the resources in the list
Deploy Preview for dlt-hub-docs canceled.
| Name | Link |
|---|---|
| Latest commit | 1f26f722212d2b894a73b5d1ff290fd8cc071564 |
| Latest deploy log | https://app.netlify.com/sites/dlt-hub-docs/deploys/66e494729541520008ee968d |
@sh-rp I'll add this to our release notes as one of the changes
@sh-rp also you are partially right with the parallelism! we have tests that are passing a list of many resources with the same names. and those tests are failing. We'd need to package them in separate sources and execute them one by one