dlt icon indicating copy to clipboard operation
dlt copied to clipboard

allow to model table and column references in dlt schema

Open rudolfix opened this issue 6 months ago • 0 comments

Background dlt does not allow to correctly model relationships between tables where columns of certain table refer to column of other table (TABLE REFERENCE).

Currently we support only NESTED REFERENCES created with parent table-level hint. It is used to link tables generated by unnesting json documents. We do this by matching "_dlt_id" (unique hint) and _parent_dlt_id (foreign key on child table) and parent table hint. We also introduced a special kind of foreign_key called root_key which always references top level table.

We need to correct NESTED REFERENCES to completely decouple it from a regular TABLE REFERENCE so both can live along each other but also could be merged into one (bring your own _dlt_id)

changes: NESTED REFERENCE will be modeled like this:

  • we keep parent table hint
  • we add row_key column hint - ( a form of primary_key) that identifies row of data (currently _dlt_id)
  • we add parent_key column key - (a form of foreign_key) (currently _parent_dlt_id)
  • root_key is implemented as it should
  • we replace foreign_key with parent_key. foreign_key will be removed from dlt (breaking)
  • existing schemas will be migrated as described in the requirements.

TABLE REFERENCE - we add a new type of table hint which is a list of:

  • referenced_table - name of the referenced table
  • columns (list of columns in table holding the hint)
  • referenced_columns (list of columns in references tables)

For properly normalized schema we'll be able to infer parent-child relationships, cardinality etc. (by looking at primary key on both sides)

Requirements

    • [x] apply row_key when creating _dlt_id columns in normalizer (and everywhere else: we do this for arrow in extract). keep unique though
    • [x] apply parent_key instead of foreign_key for _dlt_parent_id
    • [x] fix all SQL merge code that relies on unique hint to match root_key to _dlt_id. this code will take first unique column and think it is _dlt_id so this is wrong
    • [ ] (optional) - bring your own "row_key". if column definition contains this hint on a column use this column instead of _dlt_id.
    • [ ] add new hint reference to create column references. we should add it to resource decorator, apply and make hints
    • [ ] table references must be normalized, present in compute_schema and merged when merging tables
    • [ ] extract foreign key constraints from SQLAlchemy (sql_database) and emit them with the tables. Skip in minimum reflection level

migrations

    • [x] add / replace hints when loading schemas. new schema engine and migration needed. make sure to instantiate naming convention to detect "_dlt_id" and "_parent_dlt_id". you may combine it with #1673 to migrate schemas only once
    • [x] replace foreign_key with parent_key on "_parent_dlt_id columns. remove foreign_key from any other columns
    • [x] any explicit usage of foreign_key will result in an error

Implementation details Here's proposed reference hint schema:

class TableReference(TypedDict):
    columns: Sequence[str]
    referenced_table: str
    referenced_columns: Sequence[str]

basic validation must make sure that number of columns and referenced_columns is the same. we are not validating if foreign table and columns exist in dlt core. downstream systems should do that. they should also assume that columns are paired in order of appearing in their respective lists.

bring your own key may be tricky:

  1. dynamic propagation via hint, not via column name
  2. find and unify all places where we add dlt id to arrow tables what is cool about it: we can use the same column for primary and row key

Please split remaining work in 2 PRs: PR 1. implementation of reference table hint (5-7) PR 2. implement "bring your own dlt id" (4) (may be implemented in #1317 which modifies the same code)

rudolfix avatar Aug 21 '24 21:08 rudolfix