Feature: Add resource history support to access parent data
Feature description
Currently, when building resource/transformer pipelines in dlt, referencing parent data requires explicitly passing it down through every child transformer. This often leads to parent fields being unnecessarily propagated into downstream records. This complicates data consistency, pipeline logic, etc...
The proposed feature introduces history, a special parameter available in resources/transformers, that provides access to any yielded data point from a parent resource/transformer, as long as that parent has been defined with keep_history=True.
This enables direct access to parent records without having to explicitly pass them down the pipeline.,
Are you a dlt user?
Yes, I'm already a dlt user.
Use case
Consider the following situation:
@dlt.resource(keep_history=True)
def users():
yield {"id": 1, "name": "Alice"}
yield {"id": 2, "name": "Bob"}
@dlt.transformer(data_from=users)
def posts(user):
yield {"post_id": 1, "text": "Hello"}
yield {"post_id": 2, "text": "World"}
@dlt.transformer(data_from=posts)
def comments(post, history: dlt.History = None):
user = history[users]
"""
Maybe here for a request we all of the sudden need the user id because of poor api design
Instead of propagating the user_id with the json and adding data complexity we can retreive it form the history
"""
yield {"comment_id": 10, "post_id": post["post_id"], "user_id": user["id"]}
Without history, the user object would need to be passed through posts → comments, cluttering the downstream schema with redundant fields.
With history, transformers can reference upstream records directly, improving data consistency and schema cleanliness.
Proposed solution
-
Introduce a
historyobject injected into transformers when declared as a parameter:def transformer_fn(item, history: dlt.History = None): ...- The definition style of
historyis the same as themetafeature
- The definition style of
-
historyallows access to parent resources/transformers by either:- resource reference:
history[users] - string key/name:
history["users"]
- resource reference:
-
Access is only available if the parent was defined with
keep_history=True. Otherwise, anInvalidHistoryAccessexception is raised. -
EMPTY_HISTORYis provided when no history is available. -
Attempting to access a non-existent parent raises
InvalidHistoryAccess. -
The feature is optional and does not break existing transformers.
Related issues
No response
Hi @ArneDePeuter! Thanks for the thoughtful issue and pull request. I reviewed the PR and the code quality is great. Thanks for adding tests too! This is a significant feature that we'll need to discuss with the engineering team.
Problem
This is relevant when using dlt.resource and dlt.transformer for processing data before Load step. Currently, dlt is limiting because the yield statement of the resource/transformer couples two things:
- the data written to the destination table
- the data passed to the immediate children (i.e., next transformer)
This is a problem I've also encountered when doing RAG ingestion with dlt and having to manage references between docs, chunks, embeddings, etc.
Suggested solution
Create an History key-value mapping {resource_name: item} (could be a record or a batch) for the duration of the Pipe being processed. This allows downstream @dlt.transformer to access ancestor's history.
Conceptually, we're effectively going from pipe processing to a DAG. This has significant implications for guarantees and optimization. An alternative API of the example use case would be
@dlt.resource
def users(): ...
@dlt.transformer(data_from=[users])
def posts(user): ...
@dlt.transformer(data_from=[users, posts])
def comments(user, posts): ...
(This is exactly how Apache Hamilton works)
Comments / caveats
The motivating use case is a bit simple and would probably be handled decently by dlt's automated normalization. @ArneDePeuter do you have any other scenario where this feature is useful? You can give a short description or mock-up code
State generally adds a lot of complexity and makes things harder to debug. We need to have very clear specifications of what is stored in History and for how long.
Historycan grow large if resource / transformer returns a batch or pyarrow table- Right now, we can have
ResourceA -> TransformBandResourceA -> TransformCin parallel. Adding anHistoryreference toAmeans- that
BandCneed to maintain their own copy of theHistory. This furthers the memory problem - OR that
BandCare blocking until both complete (don't know if this is how it currently works).
- In either case,
pipeline.run()is always limited by the slowest transform.
- that
Next steps
Let's keep the discussion about the feature here before discussing implementation directly on the PR.
Hi @zilto, Thank you for the quick and nice response!
Conceptually, we're effectively going from pipe processing to a DAG. This has significant implications for guarantees and optimization. ... (This is exactly how Apache Hamilton works)
I agree the motivating API is DAG-like. My intent with history is to avoid changing dlt’s core tree scheduling/execution model while still addressing the coupling between “what is written” and “what is passed downstream”.
Why history (minimal surface change)
historyis a scoped, read-only key→value mapping of ancestor outputs on the current path (e.g.,{resource_name: last_item_or_handle}).- It does not introduce new scheduling edges or cross-branch coordination; execution remains tree-based. This keeps guarantees/optimizations intact and avoids a breaking shift to a full DAG API.
- It solves the common pain where downstream transforms need context from ancestors without polluting payloads or refactoring signatures.
@ArneDePeuter do you have any other scenario where this feature is useful?
I haven’t used dlt closely in a while, but I clearly recall this being a personal struggle when I did. Because I’m not working with it day-to-day right now it’s harder for me to pull a fresh real-world scenario (I hope that’s okay 🙂), but I do remember running into exactly the same issue you described.
We need to have very clear specifications of what is stored in History and for how long.
Good point! I agree that memory management rules are a nice addition.
On the parallelism concern: since execution is tree-structured and resource names are unique, sibling transformers (TransformB, TransformC) can safely share the same parent history reference without duplication or blocking. Since both siblings have their own separate branch.
I’m open to further feedback and discussion, I like where this is heading and looking forward to the team’s thoughts.
Nuance on branching/parallelism
-
Parallel siblings are safe: since resource names are unique, sibling transformers can share the same parent history reference without duplication or blocking.
-
Fan-out cases (one upstream item → multiple downstream items) require each branch to maintain a consistent view of its upstream state. Two possible approaches:
- Option A (dict-based): fork by copying the mapping at collision points. This makes lookups simple, but increases memory usage.
- Option B (tree-based): represent history as a chain of nodes; forking just creates a child node for the override. This avoids dict copies, but by default lookups walk up the parent chain. (Optionally, a lightweight lookup table can be added for O(1) access at the cost of extra memory per branch.)
Both approaches preserve branch-scope consistency. This is implementation detail rather than feature design, but I wanted to surface the trade-offs I noticed.