Simplify schema modification of child tables
Feature description
You should be able to use apply_hints on child tables just like you can on the parent tables. Right now there's no way to modify the schema of child tables without explicitly doing so in a schema.yml
Are you a dlt user?
Yes, I'm already a dlt user.
Use case
When child tables are generated all of the field types are text . This makes them essentially unusable for analysis/aggregation without further transformations. I also can't partition them.
Slack conversation: https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1721944874168369
Proposed solution
No response
Related issues
No response
What might work is to add a path parameter of type List[str] to apply_table_hints which would enable support for setting hints on subtables. The DltResource (or rather the DltResourceHints class) would have to be able to hold multiple TTableSchemas which then will be merged into the full schema at the end of the extraction or during normalization.
Example:
@dlt.resource(table_name="items")
def my_resource():
yield [{
"id": 1,
"sub_items": [
{"id": 11},
{"id": 12}
]
}]
# applies primary key hint on main table
my_resource.apply_hints(primary_key="id")
# applies primary key hin on sub_items table
my_resource.apply_hints(path=["sub_items"], primary_key="id")
# ever further nested subtable
my_resource.apply_hints(path=["sub_items", "subsub_items"], primary_key="id")
Interface proposal (based on path idea above).
- extend the
ResourceHintsto includenested_hintsdictionary that represent nested hints. Note thatDltResourcederives from hints so what I show below
resource = my_resource()
resource.nested_hints["sub_items"] = make_hints(primary_key="id")
resource.nested_hints["sub_items"]["subsub_items"] = make_hints(primary_key="id")
# works like filesystem.walk, where path is a tuple ie ("sub_items", "subsub_items") that may be converted into path
# the operation is recursive
for path, hint in resource.nested_hints.items():
...
- extend the @dlt.resource to accept
nested_hintsie as dictionary {path: hints} or asResourceHintsinstance
Some implementation details:
- There's just one resource here - the root one. All the others are hints that share data with the root. Mind that only root table(s) created may have resource name attached (again: there's just one resource name here)
- There are a few places where we
compute_table_schema. Now we need to do this but also we need to walk over nested hints and generate a table chain that we'll later add to schema - Mind that table schemas are dynamically created. And nested tables must know the name of parent table. You have two options here: or you store the parent hint and compute the parent table name recursively or you pass the previously generated parent table schema to
compute_table_schema(where I think the latter is way more efficient) - You still may allow users to specify
table_nameon the nested hint. If you do so, you'll need to modify the normalizer so it maps paths to those names. IMO this is for another ticket and bigger overhaul of the schema - prevent following to be set on nested table:
parent_table_name: TTableHintTemplate[str] = None,
incremental: TIncrementalConfig = None,
I believe this would also solve the issue described here: https://dlthub-community.slack.com/archives/C04DQA7JJN6/p1732199697666119
Background:
I have a data source that has this data in rows:
{
"id": 1,
"links": [
{
"link": "https://linkedin.com/in/some-person-1234",
"text": "Some Person (Co Founder)"
}
],
"event": {
"id": 3715858079,
"type": "meeting",
"title": "Meeting title",
"allDay": false,
"startTime": "2023-05-09T16:00:00Z",
"endTime": "2023-05-09T17:30:00Z",
"attendees": [
{
"emailAddress": "[email protected]",
"person": {
"id": 147862259,
"firstName": "Will",
"lastName": "Smith",
"primaryEmailAddress": "[email protected]",
"type": "external"
}
},
{
"emailAddress": "[email protected]",
"person": {
"id": 147862258,
"firstName": "Joscha",
"lastName": "Feth",
"primaryEmailAddress": "[email protected]",
"type": "internal"
}
},
{
"emailAddress": "[email protected]",
"person": null
}
]
}
}
What I would like to get from this is:
- a main table containing the row ID as the primary key
- a
eventstable, containing all meetings, using the event.id as the primary key and referencing the row ID as the parent - an
attendeestable, with the custom fields and a link to an existing person in thepersonstable that I provide - a
linkstable, containing all links using a synthesized primary key and referencing thw row ID as the parent
E.g. in the above example, I'd end up with:
main table:
| ID | ... |
|---|---|
| 1 | ... |
links table:
| _synth_ID | parent ID | link | text |
|---|---|---|---|
| 1234 | 1 | https://linkedin.com/in/some-person-1234 | Some Person (Co Founder) |
events table:
| ID | type | title | all_day | ... |
|---|---|---|---|---|
| 3715858079 | meeting | Meeting title | FALSE | ... |
attendees table:
| _synth_ID | parent ID | email_address | person_id |
|---|---|---|---|
| 1234 | 3715858079 | [email protected] | 147862259 |
| 4567 | 3715858079 | [email protected] | 147862258 |
| 8910 | 3715858079 | [email protected] | NULL |
persons table:
| ID | first_name | last_name | primary_email_address | type |
|---|---|---|---|---|
| 147862259 | Will | Smith | [email protected] | external |
| 147862258 | Joscha | Feth | [email protected] | internal |
Marking nested structures with hints about how to split them would work for this I believe.
This feature should be documented:
-
We need a good example (in Examples). my take is: let's go for something advanced: here's a test
test_merge_on_keys_in_schema: this test models the nested json into a set of tables with natural primary keys (not dlt generated ones). it uses some tricks to modify the nested tables. now we should be able to use our nested hints to just decorate resource properly. you are also free to change this test first and run it! the goal of the example is to show how you can custom model your nested tables. -
in Resource: https://dlthub.com/docs/general-usage/resource#define-schema. Please add a subsection on how to add hints to nested table. Keep in mind that people see this early in their user journey so:
- mention that a resource can create nested tables if the input data is nested
- that sometimes you want to change a data type of nested column ie to set it data type to date time. provide a code snippet
- In the same document we introduce
apply_hints. add a similar chapter there and show how to use apply hints to set hints for nested tables
Please implement this :)