dlt feat(relation): clarify `dlt.Relation` lifecycle

Summary

Currently, dlt.Relation is coupled to pipeline.run() in a way that those two are not equivalent

@dlt.transformation
def foo(dataset: dlt.Dataset):
  yield dataset.table("bar")

# retrieve relation from transformation then execute and get data
result_rel: dlt.Relation = list(foo(dataset))[0]
result_rel.df()

# execute transformation via pipeline.run() then get data
pipeline.run([foo])
result2_rel = pipeline.dataset().table("foo")
result2_rel.df()

This behavior is inconsistent with @dlt.resource where doing list(resource) will gather records in-memory and add the _dlt_id and _dlt_load_id fields.

This happens because dlt.Pipeline.run() goes through Extract, Normalize, Load. The Normalize step adds the _dlt_id and _dlt_load_id columns.

Motivation

We want to be able to define @dlt.transformation interactively without requiring the heavy dlt.Pipeline.run() constructs on each execution.

The column _dlt_id exists on all tables produced by dlt by default and is useful to uniquely identify records. This is useful for transformation when:

identifying rows for data quality checks
incremental strategies
smart joins

Intended behavior

list(transformation)[0] should include the _dlt_id and _dlt_load_id just like list(resource) does
Currently, transformation outputs are all treated like root tables; they have their own _dlt_id and _dlt_load_id with no parent.

This could change in the future (e.g., we generate dimensions and facts), but is good for now

Executing dlt.Relation via .df(), .fetchall(), etc. shouldn't add _dlt_id nor _dlt_load_id. Relations are a more generic mechanism than transformations
Methods and features related to dlt.Dataset shouldn't mutate queries to include / generate _dlt_id nor _dlt_load_id

Potential solution

The internals of @dlt.transformation (e.g., make_transformation_function) produce a dlt.Relation. The code should be changed such that dlt.Relation generates the _dlt_id and _dlt_load_id instead of this being done in the Normalization of pipeline.run()

question: can I have a list of things that happen a Normalization for @dlt.transformation?

Oct 14 '25 21:10 zilto

A little background:

listing the regular resource does not add _dlt_id and does not normalize any identifiers to fit the destination etc. all of this things happen in normalizer. resource yields data as seen in data source keeping the namespaces and original structure.
transformation resource does the same thing. it does not do any destination-specific query rewrites (like normalizing identifiers, quoting, but also adding _dlt_id which is destination specific). what it yields is a query that obtains source data. and normalizer is responsible to rewrite (wrap in subquery) in a manner that does the above.

Still I think what you need makes a lot of sense. EL is not T. In your fast prototyping workflow we can assume that you are always implicitly bound to duckdb like destination and we could augment the relation that you get via. list(transformation) or transformation.relation with _dlt_id and _load_id and other things that normalizer would do.

note: since we use a separate Resource class for transformations we can implement custom __iter__ on it that does exactly what we want

@zilto pls tell me if this is correct thinking? if so then I think this improvement is pretty doable

Oct 15 '25 12:10 rudolfix

@rudolfix That all makes sense. I was also thinking we could modify the "transformation resource spec" class.

listing the regular resource does not add _dlt_id I got something wrong then, will go back and explore.

At this stage, the main outcome I'm looking for is documentation for developers.

Source of truth should be code + tests, but it's useful to have a "reference manual" for high-level behavior and assumptions. For example, lifecycle of extract -> normalize -> load, when are credentials resolved, etc.

Oct 15 '25 12:10 zilto