feat(relation): clarify `dlt.Relation` lifecycle
Summary
Currently, dlt.Relation is coupled to pipeline.run() in a way that those two are not equivalent
@dlt.transformation
def foo(dataset: dlt.Dataset):
yield dataset.table("bar")
# retrieve relation from transformation then execute and get data
result_rel: dlt.Relation = list(foo(dataset))[0]
result_rel.df()
# execute transformation via pipeline.run() then get data
pipeline.run([foo])
result2_rel = pipeline.dataset().table("foo")
result2_rel.df()
This behavior is inconsistent with @dlt.resource where doing list(resource) will gather records in-memory and add the _dlt_id and _dlt_load_id fields.
This happens because dlt.Pipeline.run() goes through Extract, Normalize, Load. The Normalize step adds the _dlt_id and _dlt_load_id columns.
Motivation
We want to be able to define @dlt.transformation interactively without requiring the heavy dlt.Pipeline.run() constructs on each execution.
The column _dlt_id exists on all tables produced by dlt by default and is useful to uniquely identify records. This is useful for transformation when:
- identifying rows for data quality checks
- incremental strategies
- smart joins
Intended behavior
-
list(transformation)[0]should include the_dlt_idand_dlt_load_idjust likelist(resource)does - Currently, transformation outputs are all treated like root tables; they have their own
_dlt_idand_dlt_load_idwith no parent.
- This could change in the future (e.g., we generate dimensions and facts), but is good for now
- Executing
dlt.Relationvia.df(),.fetchall(), etc. shouldn't add_dlt_idnor_dlt_load_id. Relations are a more generic mechanism than transformations - Methods and features related to
dlt.Datasetshouldn't mutate queries to include / generate_dlt_idnor_dlt_load_id
Potential solution
The internals of @dlt.transformation (e.g., make_transformation_function) produce a dlt.Relation. The code should be changed such that dlt.Relation generates the _dlt_id and _dlt_load_id instead of this being done in the Normalization of pipeline.run()
question: can I have a list of things that happen a Normalization for @dlt.transformation?
A little background:
- listing the regular resource does not add
_dlt_idand does not normalize any identifiers to fit the destination etc. all of this things happen in normalizer. resource yields data as seen in data source keeping the namespaces and original structure. -
transformationresource does the same thing. it does not do any destination-specific query rewrites (like normalizing identifiers, quoting, but also adding _dlt_id which is destination specific). what it yields is a query that obtains source data. and normalizer is responsible to rewrite (wrap in subquery) in a manner that does the above.
Still I think what you need makes a lot of sense. EL is not T. In your fast prototyping workflow we can assume that you are always implicitly bound to duckdb like destination and we could augment the relation that you get via. list(transformation) or transformation.relation with _dlt_id and _load_id and other things that normalizer would do.
note: since we use a separate Resource class for transformations we can implement custom __iter__ on it that does exactly what we want
@zilto pls tell me if this is correct thinking? if so then I think this improvement is pretty doable
@rudolfix That all makes sense. I was also thinking we could modify the "transformation resource spec" class.
listing the regular resource does not add _dlt_id I got something wrong then, will go back and explore.
At this stage, the main outcome I'm looking for is documentation for developers.
Source of truth should be code + tests, but it's useful to have a "reference manual" for high-level behavior and assumptions. For example, lifecycle of extract -> normalize -> load, when are credentials resolved, etc.