dlt icon indicating copy to clipboard operation
dlt copied to clipboard

normalize_py_arrow_item now replaces load id column with the right one

Open anuunchin opened this issue 7 months ago • 2 comments

Description

This PR addresses the first task in #2493, which says:

fix the arrow normalizer: normalize_py_arrow_item will not replace _dlt_load_id column if it exists. if requested, we must replace the existing column with right load id. this is actually a bug. python dict normalizer overwrites load ids and this is the desired behavior

Now, the arrow normalizer replaces the load id even if the column exists. An additional test is added to test_arrow_sources.py.

Related Issues

  • Relates to #2493

More context

Please, read conversation below.

anuunchin avatar Apr 15 '25 13:04 anuunchin

Deploy Preview for dlt-hub-docs canceled.

Name Link
Latest commit 81a531ca2056b430c2f6aedb8106cd5760b48a43
Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/6813b9308eb146000872971f

netlify[bot] avatar Apr 15 '25 13:04 netlify[bot]

@rudolfix the following changes are incorporated:

  • load id logic completely moved to extract
  • add_constant_column in dlt/common/libs/pyarrow.py adjusted to use DictionaryArray
  • add_constant_column is then used in the load id adding logic in the extractor

Question: Considering another reason why it makes sense: we may handle _dlt_id the same way. in fact it is already handled in the normalize step (we add it when streaming data so we do not overload the memory), should be _dlt_id logic also be moved to the extractor? If this is going in the right direction and load id is correctly created at extract, is it normal that _dlt_id is created at normalize?

@sh-rp test improved with feedback

anuunchin avatar Apr 16 '25 12:04 anuunchin