dlt icon indicating copy to clipboard operation
dlt copied to clipboard

feat(rest_api): support `.add_limit()` for derived endpoints

Open zilto opened this issue 3 months ago • 3 comments

When using a REST API source that has endpoints dependencies such as repos/dlt-hub/dlt/actions/workflows/{resources.workflows.id}/runs, you can't use .add_limit()

resources = [
    {
        "name": "workflows",
        "endpoint": {
            "path": f"repos/{owner}/{repo}/actions/workflows",
            "data_selector": "workflows",
        },
    },
    {
        "name": "runs",
        "endpoint": {
            "path": f"repos/{owner}/{repo}/actions/workflows/{{resources.workflows.id}}/runs",
            "data_selector": "workflow_runs"
        }
    },
]

You will get

...|dlt|resource.py|add_limit:404|Setting add_limit to a transformer runs has no effect. Set the limit on the top level resource.

❗ it's possible that there is no issue / bug and the warning is a false positive. For example, .add_limit() is properly set on the parent workflows, but it is also set on the child workflows_runs, which triggers the warning

Problem

The user-facing function rest_api_resources() returns resources already piped. This is done internally by create_resources() which uses the data_from kwarg of @dlt.resource() to instantiate transformers.

Then, using .add_limit() on the returned dlt.resource doesn't successfully update the relationships the other resources's pipe

Todo

  • figure out how .add_limit() mutates resources in place and how it's propagated to transformer relationships (this should be accessible in a DltResource.pipe attribute)
  • Update and improve the warning. The message Setting add_limit to a transformer runs is not useful given the user didn't set these things. The warning message should better describe the solution

Related

  • similar issue to built-in file readers sources: https://github.com/dlt-hub/dlt/issues/2858

zilto avatar Sep 17 '25 18:09 zilto

We can't limit transformer directly because it has no generator element so we cannot close it. what's possible:

  • propagate add_limit step to the upstream resource via connected pipe. this is easy. but will limit parent resource so it may feel a little non intuitive. also you cannot limit unbounded transformer
  • limit transformer but close the parent resource. this needs to be pretty well tested to be sure that we drain all pending items so the the extracted data is consistent (all the data in extract pipe should be processed till the end)

rudolfix avatar Sep 17 '25 20:09 rudolfix

propagate add_limit step to the upstream resource via connected pipe. this is easy. but will limit parent resource so it may feel a little non intuitive. also you cannot limit unbounded transformer

I believe this is the intended / intuitive behavior in my case with rest_api_source or rest_api_resources (and the linked issue about file readers). I just want to be able to do the following to load 10 rows of all tables.

github_action_source = rest_api_source(...)

pipeline.run(github_action_source().add_limit(10))

This currently doesn't work. Maybe this ticket is better framed as ".add_limit()" on sources is not behaving as expected when transformers are involved

limit transformer but close the parent resource. this needs to be pretty well tested to be sure that we drain all pending items so the the extracted data is consistent (all the data in extract pipe should be processed till the end)

I understand what you're saying as

github_action_source = rest_api_source(...)()

# has 50 rows
workflows_resource = github_action_source.resources["workflows"] 
# has 80 000 rows
runs_resource = github_action_source.resources["runs"]

# currently, it's impossible to limit `runs` without limiting `workflows`
runs_resource.add_limit(10)

zilto avatar Sep 18 '25 20:09 zilto

Another avenue, which may be simpler (or not), would be that .add_limit() applies to the Paginator

zilto avatar Sep 23 '25 18:09 zilto