dlt icon indicating copy to clipboard operation
dlt copied to clipboard

Limits are not applied to non-selected Resources

Open FridayPush opened this issue 1 year ago • 1 comments

dlt version

0.5.1

Describe the problem

tldr: rest_api resource with selected: False that has downstream transformations which yield tables, does not receive limits when applied on the top level of the source.

While developing a pipeline with the rest_api verified source I encountered a situation where using a source.add_limit(#) does not apply as expected. The scenario is that the API endpoint I was consuming returns many subjects in a single json response. Instead of materializing a table on the top level object I have a transform function that pulls out various nested objects and arrays and yields records for separate tables. For reference something like:

Support_Ticket_base_object: -- Customer Details -- Ticket Summary details -- Subarray of Tags -- Subarray of Messages -- Subarray of actions taken by support on the ticket

As I was manually controlling the ticket object base and didn't want a flattened/json field for tags/etc I had the top level resource marked as 'selected:False'.

Expected behavior

The resource to receive the limit applied to the source, and downstream transformations to only receive the limits amount of batches from their respective resource. In this scenario if the main resource returns batches of 100, then I would expect the transform to receive 4 batches.

Steps to reproduce

source = rest_api_source({
    "client": {
        "base_url": "https://mydomain.gorgias.com/api/",
        "auth": auth,
        "paginator": JSONResponseCursorPaginator(cursor_path="meta.next_cursor"),
    },
    "resource_defaults": {
        "primary_key": "id",
        "write_disposition": "merge",
        "endpoint": {
            "params": {
                "limit": 100,
            },
        },
    },
    "resources": [
        {
            "name": "teams",
            "endpoint": {
                "path": "teams",
            },
            "selected": False,
        },
        {
            "name": "tickets",
            "endpoint": {
                "path": "tickets",
                "params": {
                    "order_by": "updated_datetime:desc",
                    "limit": 100,
                    "view_id": 123
                }
            },
            "selected": False,
        }
    ],
}, max_table_nesting=0)

@dlt.transformer
def process_tickets(ticket_items):
  ...code that takes tickets and plucks fields and nested dicts 'id' field to top level like user{"id":....} to user_id
   yield dlt.mark.with_table_name(resulting_dict_list, "tickets")

load_info = pipeline.run(source.add_limit(4))

Operating system

macOS

Runtime environment

Local

Python version

3.11

dlt data source

rest_api

dlt destination

DuckDB

Other deployment details

No response

Additional information

Workaround is to materialize the table with selected: True while in development.

FridayPush avatar Jul 16 '24 15:07 FridayPush