feat(rest_api): support `.add_limit()` for derived endpoints
When using a REST API source that has endpoints dependencies such as repos/dlt-hub/dlt/actions/workflows/{resources.workflows.id}/runs, you can't use .add_limit()
resources = [
{
"name": "workflows",
"endpoint": {
"path": f"repos/{owner}/{repo}/actions/workflows",
"data_selector": "workflows",
},
},
{
"name": "runs",
"endpoint": {
"path": f"repos/{owner}/{repo}/actions/workflows/{{resources.workflows.id}}/runs",
"data_selector": "workflow_runs"
}
},
]
You will get
...|dlt|resource.py|add_limit:404|Setting add_limit to a transformer runs has no effect. Set the limit on the top level resource.
❗ it's possible that there is no issue / bug and the warning is a false positive. For example,
.add_limit()is properly set on the parentworkflows, but it is also set on the childworkflows_runs, which triggers the warning
Problem
The user-facing function rest_api_resources() returns resources already piped. This is done internally by create_resources() which uses the data_from kwarg of @dlt.resource() to instantiate transformers.
Then, using .add_limit() on the returned dlt.resource doesn't successfully update the relationships the other resources's pipe
Todo
- figure out how
.add_limit()mutates resources in place and how it's propagated to transformer relationships (this should be accessible in aDltResource.pipeattribute) - Update and improve the warning. The message
Setting add_limit to a transformer runsis not useful given the user didn't set these things. The warning message should better describe the solution
Related
- similar issue to built-in file readers sources: https://github.com/dlt-hub/dlt/issues/2858
We can't limit transformer directly because it has no generator element so we cannot close it. what's possible:
- propagate add_limit step to the upstream resource via connected pipe. this is easy. but will limit parent resource so it may feel a little non intuitive. also you cannot limit unbounded transformer
- limit transformer but close the parent resource. this needs to be pretty well tested to be sure that we drain all pending items so the the extracted data is consistent (all the data in extract pipe should be processed till the end)
propagate add_limit step to the upstream resource via connected pipe. this is easy. but will limit parent resource so it may feel a little non intuitive. also you cannot limit unbounded transformer
I believe this is the intended / intuitive behavior in my case with rest_api_source or rest_api_resources (and the linked issue about file readers). I just want to be able to do the following to load 10 rows of all tables.
github_action_source = rest_api_source(...)
pipeline.run(github_action_source().add_limit(10))
This currently doesn't work. Maybe this ticket is better framed as ".add_limit()" on sources is not behaving as expected when transformers are involved
limit transformer but close the parent resource. this needs to be pretty well tested to be sure that we drain all pending items so the the extracted data is consistent (all the data in extract pipe should be processed till the end)
I understand what you're saying as
github_action_source = rest_api_source(...)()
# has 50 rows
workflows_resource = github_action_source.resources["workflows"]
# has 80 000 rows
runs_resource = github_action_source.resources["runs"]
# currently, it's impossible to limit `runs` without limiting `workflows`
runs_resource.add_limit(10)
Another avenue, which may be simpler (or not), would be that .add_limit() applies to the Paginator