dlt
dlt copied to clipboard
normalize_py_arrow_item now replaces load id column with the right one
Description
This PR addresses the first task in #2493, which says:
fix the arrow normalizer:
normalize_py_arrow_itemwill not replace _dlt_load_id column if it exists. if requested, we must replace the existing column with right load id. this is actually a bug. python dict normalizer overwrites load ids and this is the desired behavior
Now, the arrow normalizer replaces the load id even if the column exists. An additional test is added to test_arrow_sources.py.
Related Issues
- Relates to #2493
More context
Please, read conversation below.
Deploy Preview for dlt-hub-docs canceled.
| Name | Link |
|---|---|
| Latest commit | 81a531ca2056b430c2f6aedb8106cd5760b48a43 |
| Latest deploy log | https://app.netlify.com/sites/dlt-hub-docs/deploys/6813b9308eb146000872971f |
@rudolfix the following changes are incorporated:
- load id logic completely moved to extract
add_constant_columnindlt/common/libs/pyarrow.pyadjusted to use DictionaryArrayadd_constant_columnis then used in the load id adding logic in the extractor
Question: Considering another reason why it makes sense: we may handle _dlt_id the same way. in fact it is already handled in the normalize step (we add it when streaming data so we do not overload the memory), should be _dlt_id logic also be moved to the extractor? If this is going in the right direction and load id is correctly created at extract, is it normal that _dlt_id is created at normalize?
@sh-rp test improved with feedback