dlt creates a single source in extract for all resource instances passed as list

creates a single source in extract for all resource instances passed as list

Open rudolfix opened this issue 7 months ago • 3 comments

Description

We discovered peculiar problem with rest_api when users were passing a list of resources to the run function that contained a resource and a transformer:

     page = 1

    @dlt.resource(name="pages")
    def gen_pages():
        nonlocal page
        while True:
            yield {"page": page}
            if page == 10:
                return
            page += 1

    @dlt.transformer(name="subpages")
    def get_subpages(page_item):
        yield from [
            {
                "page": page_item["page"],
                "subpage": subpage,
            }
            for subpage in range(1, 11)
        ]

    pipeline = dlt.pipeline("test_resource_transformer_standalone", destination="duckdb")
    # here we must combine resources and transformers using the same instance
    info = pipeline.run([gen_pages, gen_pages | get_subpages])

in the case above only last page is passed to the transformer (see the commits for tests and details) the root cause is that each resource in the list was packaged in a separate source and extracted separately. that prevented any DAG optimizations and gen_pages was extracted twice.

here we change the behavior where a single dlt source is used to extract all the resources in the list

Jul 02 '24 17:07 rudolfix

dlt dlt copied to clipboard

creates a single source in extract for all resource instances passed as list

Description

dlt
dlt copied to clipboard