Limits are not applied to non-selected Resources
dlt version
0.5.1
Describe the problem
tldr: rest_api resource with selected: False that has downstream transformations which yield tables, does not receive limits when applied on the top level of the source.
While developing a pipeline with the rest_api verified source I encountered a situation where using a source.add_limit(#) does not apply as expected. The scenario is that the API endpoint I was consuming returns many subjects in a single json response. Instead of materializing a table on the top level object I have a transform function that pulls out various nested objects and arrays and yields records for separate tables. For reference something like:
Support_Ticket_base_object: -- Customer Details -- Ticket Summary details -- Subarray of Tags -- Subarray of Messages -- Subarray of actions taken by support on the ticket
As I was manually controlling the ticket object base and didn't want a flattened/json field for tags/etc I had the top level resource marked as 'selected:False'.
Expected behavior
The resource to receive the limit applied to the source, and downstream transformations to only receive the limits amount of batches from their respective resource. In this scenario if the main resource returns batches of 100, then I would expect the transform to receive 4 batches.
Steps to reproduce
source = rest_api_source({
"client": {
"base_url": "https://mydomain.gorgias.com/api/",
"auth": auth,
"paginator": JSONResponseCursorPaginator(cursor_path="meta.next_cursor"),
},
"resource_defaults": {
"primary_key": "id",
"write_disposition": "merge",
"endpoint": {
"params": {
"limit": 100,
},
},
},
"resources": [
{
"name": "teams",
"endpoint": {
"path": "teams",
},
"selected": False,
},
{
"name": "tickets",
"endpoint": {
"path": "tickets",
"params": {
"order_by": "updated_datetime:desc",
"limit": 100,
"view_id": 123
}
},
"selected": False,
}
],
}, max_table_nesting=0)
@dlt.transformer
def process_tickets(ticket_items):
...code that takes tickets and plucks fields and nested dicts 'id' field to top level like user{"id":....} to user_id
yield dlt.mark.with_table_name(resulting_dict_list, "tickets")
load_info = pipeline.run(source.add_limit(4))
Operating system
macOS
Runtime environment
Local
Python version
3.11
dlt data source
rest_api
dlt destination
DuckDB
Other deployment details
No response
Additional information
Workaround is to materialize the table with selected: True while in development.