dlt
dlt copied to clipboard
pyarrow/pandas: add load id and dlt id in the extract phase and unify the behavior
Background
By default we do not add load_id
and dlt_id
to arrow tables. This must be configured explicitly and happens in the normalizer.
As a consequence, we need to decompress and rewrite parquet files which takes a lot of resources.
In this ticket we move this behavior to the extract phase. This is against general architecture but I do not see any other way to do that without rewriting files.
We also unify the behavior making relational
normalizer to follow ItemsNormalizerConfiguration
Implementation We split this ticket into several PR. PR 1.
-
- [x] add
load_id
in the extract phase.
- [x] add
-
- [x] make sure we do not clash with normalize which also add
load_id
(can we remove it from there?)
- [x] make sure we do not clash with normalize which also add
-
- [x] we (probably) do not need the logic that adds the columns when writing a file. we can just add them to existing table
-
- [x]
ItemsNormalizerConfiguration
must be taken into account. this is probably a breaking change because we need to move it fromnormalize
toextract
so old settings will stop working. or maybe you'll find a clever solution here :)
- [x]
PR 2.
Fully unify arrow and relational normalizer. This will also prepare dlt
to generate nested
(not json
) data types in the future.
-
- [ ] add
dlt_id
generation. mind that we have a few ways to generatedlt_id
which are found inrelational.py
. functions that decide on the type of the key that is used are static and you can extract them
- [ ] add
-
- [ ] when adding _dlt_id we must follow table settings and generate
_dlt_id
according to hints (ie. SCD2 look howrelational.py
generates different hashes.). also we have a fast method to generate content hashesh: add_row_hash_to_table
- [ ] when adding _dlt_id we must follow table settings and generate
-
- [ ] observe "bring your own hash". if there's a column with unique, do not add
_dlt_id
(random one). if we have SCD2 type hash (please see SCD2 documentation on how to add it) we also skip it
- [ ] observe "bring your own hash". if there's a column with unique, do not add
-
- [ ] observe max nesting level. explode lists and flatten structs, observe the same rules that generate nested types in relational.py see: https://chatgpt.com/share/e/66f06858-dfc8-8012-aff3-223105ddd11b and #1793 the default behavior should be no unnesting for arrow (so it is backward compatible)
-
- [ ] when we add new columns from pyarrow we should also infer hints like for any new columns. currently schema settings will be ignored (see. _infer_column but it must be modified to just infer hints). this, for example, happens in
_compute_table
(extract)
- [ ] when we add new columns from pyarrow we should also infer hints like for any new columns. currently schema settings will be ignored (see. _infer_column but it must be modified to just infer hints). this, for example, happens in
Ideally we'd add _dlt_id
already in the extract phase, also infer columns properly. un-nesting may happen in normalize (so we have rewrite)