dlt
dlt copied to clipboard
allow to model table and column references in dlt schema
Background
dlt
does not allow to correctly model relationships between tables where columns of certain table refer to column of other table (TABLE REFERENCE).
Currently we support only NESTED REFERENCES created with parent
table-level hint. It is used to link tables generated by unnesting json documents. We do this by matching "_dlt_id" (unique
hint) and _parent_dlt_id
(foreign key on child table) and parent
table hint. We also introduced a special kind of foreign_key
called root_key
which always references top level table.
We need to correct NESTED REFERENCES to completely decouple it from a regular TABLE REFERENCE so both can live along each other but also could be merged into one (bring your own _dlt_id)
changes: NESTED REFERENCE will be modeled like this:
- we keep
parent
table hint - we add
row_key
column hint - ( a form ofprimary_key
) that identifies row of data (currently_dlt_id
) - we add
parent_key
column key - (a form offoreign_key
) (currently_parent_dlt_id
) -
root_key
is implemented as it should - we replace
foreign_key
withparent_key
.foreign_key
will be removed fromdlt
(breaking) - existing schemas will be migrated as described in the requirements.
TABLE REFERENCE - we add a new type of table hint which is a list of:
- referenced_table - name of the referenced table
- columns (list of columns in table holding the hint)
- referenced_columns (list of columns in references tables)
For properly normalized schema we'll be able to infer parent-child relationships, cardinality etc. (by looking at primary key on both sides)
Requirements
-
- [x] apply
row_key
when creating_dlt_id
columns in normalizer (and everywhere else: we do this for arrow in extract). keepunique
though
- [x] apply
-
- [x] apply
parent_key
instead offoreign_key
for_dlt_parent_id
- [x] apply
-
- [x] fix all SQL merge code that relies on
unique
hint to matchroot_key
to_dlt_id
. this code will take first unique column and think it is_dlt_id
so this is wrong
- [x] fix all SQL merge code that relies on
-
- [ ] (optional) - bring your own "row_key". if column definition contains this hint on a column use this column instead of
_dlt_id
.
- [ ] (optional) - bring your own "row_key". if column definition contains this hint on a column use this column instead of
-
- [ ] add new hint
reference
to create column references. we should add it to resource decorator, apply and make hints
- [ ] add new hint
-
- [ ] table references must be normalized, present in
compute_schema
and merged when merging tables
- [ ] table references must be normalized, present in
-
- [ ] extract foreign key constraints from SQLAlchemy (sql_database) and emit them with the tables. Skip in minimum reflection level
migrations
-
- [x] add / replace hints when loading schemas. new schema engine and migration needed. make sure to instantiate naming convention to detect "_dlt_id" and "_parent_dlt_id". you may combine it with #1673 to migrate schemas only once
-
- [x] replace
foreign_key
withparent_key
on"_parent_dlt_id
columns. removeforeign_key
from any other columns
- [x] replace
-
- [x] any explicit usage of
foreign_key
will result in an error
- [x] any explicit usage of
Implementation details
Here's proposed reference
hint schema:
class TableReference(TypedDict):
columns: Sequence[str]
referenced_table: str
referenced_columns: Sequence[str]
basic validation must make sure that number of columns and referenced_columns is the same. we are not validating if foreign table and columns exist in dlt core. downstream systems should do that. they should also assume that columns are paired in order of appearing in their respective lists.
bring your own key may be tricky:
- dynamic propagation via hint, not via column name
- find and unify all places where we add dlt id to arrow tables what is cool about it: we can use the same column for primary and row key
Please split remaining work in 2 PRs:
PR 1. implementation of reference
table hint (5-7)
PR 2. implement "bring your own dlt id" (4) (may be implemented in #1317 which modifies the same code)