dlt icon indicating copy to clipboard operation
dlt copied to clipboard

Feature: Add resource history support to access parent data

Open ArneDePeuter opened this issue 3 months ago • 3 comments

Feature description

Currently, when building resource/transformer pipelines in dlt, referencing parent data requires explicitly passing it down through every child transformer. This often leads to parent fields being unnecessarily propagated into downstream records. This complicates data consistency, pipeline logic, etc...

The proposed feature introduces history, a special parameter available in resources/transformers, that provides access to any yielded data point from a parent resource/transformer, as long as that parent has been defined with keep_history=True.

This enables direct access to parent records without having to explicitly pass them down the pipeline.,

Are you a dlt user?

Yes, I'm already a dlt user.

Use case

Consider the following situation:

@dlt.resource(keep_history=True)
def users():
    yield {"id": 1, "name": "Alice"}
    yield {"id": 2, "name": "Bob"}

@dlt.transformer(data_from=users)
def posts(user):
    yield {"post_id": 1, "text": "Hello"}
    yield {"post_id": 2, "text": "World"}

@dlt.transformer(data_from=posts)
def comments(post, history: dlt.History = None):
    user = history[users]
    """
    Maybe here for a request we all of the sudden need the user id because of poor api design
    Instead of propagating the user_id with the json and adding data complexity we can retreive it form the history
    """
    yield {"comment_id": 10, "post_id": post["post_id"], "user_id": user["id"]}

Without history, the user object would need to be passed through posts → comments, cluttering the downstream schema with redundant fields.

With history, transformers can reference upstream records directly, improving data consistency and schema cleanliness.

Proposed solution

  • Introduce a history object injected into transformers when declared as a parameter:

    def transformer_fn(item, history: dlt.History = None): ...
    
    • The definition style of history is the same as the meta feature
  • history allows access to parent resources/transformers by either:

    • resource reference: history[users]
    • string key/name: history["users"]
  • Access is only available if the parent was defined with keep_history=True. Otherwise, an InvalidHistoryAccess exception is raised.

  • EMPTY_HISTORY is provided when no history is available.

  • Attempting to access a non-existent parent raises InvalidHistoryAccess.

  • The feature is optional and does not break existing transformers.

Related issues

No response

ArneDePeuter avatar Oct 01 '25 13:10 ArneDePeuter

Hi @ArneDePeuter! Thanks for the thoughtful issue and pull request. I reviewed the PR and the code quality is great. Thanks for adding tests too! This is a significant feature that we'll need to discuss with the engineering team.

Problem

This is relevant when using dlt.resource and dlt.transformer for processing data before Load step. Currently, dlt is limiting because the yield statement of the resource/transformer couples two things:

  1. the data written to the destination table
  2. the data passed to the immediate children (i.e., next transformer)

This is a problem I've also encountered when doing RAG ingestion with dlt and having to manage references between docs, chunks, embeddings, etc.

Suggested solution

Create an History key-value mapping {resource_name: item} (could be a record or a batch) for the duration of the Pipe being processed. This allows downstream @dlt.transformer to access ancestor's history.

Conceptually, we're effectively going from pipe processing to a DAG. This has significant implications for guarantees and optimization. An alternative API of the example use case would be

@dlt.resource
def users(): ...

@dlt.transformer(data_from=[users])
def posts(user): ...

@dlt.transformer(data_from=[users, posts])
def comments(user, posts): ...

(This is exactly how Apache Hamilton works)

Comments / caveats

The motivating use case is a bit simple and would probably be handled decently by dlt's automated normalization. @ArneDePeuter do you have any other scenario where this feature is useful? You can give a short description or mock-up code

State generally adds a lot of complexity and makes things harder to debug. We need to have very clear specifications of what is stored in History and for how long.

  • History can grow large if resource / transformer returns a batch or pyarrow table
  • Right now, we can have ResourceA -> TransformB and ResourceA -> TransformC in parallel. Adding an History reference to A means
    1. that B and C need to maintain their own copy of the History. This furthers the memory problem
    2. OR that B and C are blocking until both complete (don't know if this is how it currently works).
    • In either case, pipeline.run() is always limited by the slowest transform.

Next steps

Let's keep the discussion about the feature here before discussing implementation directly on the PR.

zilto avatar Oct 01 '25 14:10 zilto

Hi @zilto, Thank you for the quick and nice response!


Conceptually, we're effectively going from pipe processing to a DAG. This has significant implications for guarantees and optimization. ... (This is exactly how Apache Hamilton works)

I agree the motivating API is DAG-like. My intent with history is to avoid changing dlt’s core tree scheduling/execution model while still addressing the coupling between “what is written” and “what is passed downstream”.

Why history (minimal surface change)

  • history is a scoped, read-only key→value mapping of ancestor outputs on the current path (e.g., {resource_name: last_item_or_handle}).
  • It does not introduce new scheduling edges or cross-branch coordination; execution remains tree-based. This keeps guarantees/optimizations intact and avoids a breaking shift to a full DAG API.
  • It solves the common pain where downstream transforms need context from ancestors without polluting payloads or refactoring signatures.

@ArneDePeuter do you have any other scenario where this feature is useful?

I haven’t used dlt closely in a while, but I clearly recall this being a personal struggle when I did. Because I’m not working with it day-to-day right now it’s harder for me to pull a fresh real-world scenario (I hope that’s okay 🙂), but I do remember running into exactly the same issue you described.


We need to have very clear specifications of what is stored in History and for how long.

Good point! I agree that memory management rules are a nice addition.


On the parallelism concern: since execution is tree-structured and resource names are unique, sibling transformers (TransformB, TransformC) can safely share the same parent history reference without duplication or blocking. Since both siblings have their own separate branch.


I’m open to further feedback and discussion, I like where this is heading and looking forward to the team’s thoughts.

ArneDePeuter avatar Oct 01 '25 14:10 ArneDePeuter

Nuance on branching/parallelism

  • Parallel siblings are safe: since resource names are unique, sibling transformers can share the same parent history reference without duplication or blocking.

  • Fan-out cases (one upstream item → multiple downstream items) require each branch to maintain a consistent view of its upstream state. Two possible approaches:

    • Option A (dict-based): fork by copying the mapping at collision points. This makes lookups simple, but increases memory usage.
    • Option B (tree-based): represent history as a chain of nodes; forking just creates a child node for the override. This avoids dict copies, but by default lookups walk up the parent chain. (Optionally, a lightweight lookup table can be added for O(1) access at the cost of extra memory per branch.)

Both approaches preserve branch-scope consistency. This is implementation detail rather than feature design, but I wanted to surface the trade-offs I noticed.

ArneDePeuter avatar Oct 01 '25 20:10 ArneDePeuter