dlt normalize_py_arrow_item now replaces load id column with the right one

normalize_py_arrow_item now replaces load id column with the right one

Open anuunchin opened this issue 7 months ago • 2 comments

Description

This PR addresses the first task in #2493, which says:

fix the arrow normalizer: normalize_py_arrow_item will not replace _dlt_load_id column if it exists. if requested, we must replace the existing column with right load id. this is actually a bug. python dict normalizer overwrites load ids and this is the desired behavior

Now, the arrow normalizer replaces the load id even if the column exists. An additional test is added to test_arrow_sources.py.

Related Issues

Relates to #2493

More context

Please, read conversation below.

Apr 15 '25 13:04 anuunchin

Deploy Preview for dlt-hub-docs canceled.

Name	Link
Latest commit	81a531ca2056b430c2f6aedb8106cd5760b48a43
Latest deploy log	https://app.netlify.com/sites/dlt-hub-docs/deploys/6813b9308eb146000872971f

Apr 15 '25 13:04 netlify[bot]

@rudolfix the following changes are incorporated:

load id logic completely moved to extract
add_constant_column in dlt/common/libs/pyarrow.py adjusted to use DictionaryArray
add_constant_column is then used in the load id adding logic in the extractor

Question: Considering another reason why it makes sense: we may handle _dlt_id the same way. in fact it is already handled in the normalize step (we add it when streaming data so we do not overload the memory), should be _dlt_id logic also be moved to the extractor? If this is going in the right direction and load id is correctly created at extract, is it normal that _dlt_id is created at normalize?

@sh-rp test improved with feedback

Apr 16 '25 12:04 anuunchin

dlt dlt copied to clipboard

normalize_py_arrow_item now replaces load id column with the right one

Description

Related Issues

More context

✅ Deploy Preview for dlt-hub-docs canceled.

dlt
dlt copied to clipboard

Deploy Preview for dlt-hub-docs canceled.