dlt icon indicating copy to clipboard operation
dlt copied to clipboard

Fix/1571 Incremental: Optionally load or ignore records with cursor_path missing or None value

Open willi-mueller opened this issue 7 months ago • 4 comments

Description

This PR: Allows users to specify what happens when the value at the incremental cursor path is None or the field is missing in a row. It also unifies the handling of null values in pandas/arrow with python objects.

Consider the following example data where created_at is the incremental cursor path:

```py
data_1 = [
  {"a": 1, "created_at": 1},
  {"a": 2, "created_at": None},
]

data_2 = [
  {"a": 1, "created_at": 1},
  {"a": 2},
]

The options are:

  1. incremental(..., on_cursor_value_missing="raise"). This will raise IncrementalCursorPathHasValueNone for data_1 and IncrementalCursorPathMissing for data_2.
  2. incremental(..., on_cursor_value_missing="include"). This will load all rows for both data_1 and data_2 respectively.
  3. incremental(..., on_cursor_value_missing="exclude"). This will load only the first row for both data_1 and data_2 respectively.

This PR also adds documentation on how to load data with None at the cursor path incrementally

All outlined features are implemented and tested for all 4 data formats (object, pandas, arrow-table, arrow batch). However, JSON path cursors are still only supported for JSON objects but not for arrow and pandas.

Done in collaboration with @francescomucio

TODO

  • [x] docs: explain new parameter
  • [x] docs: explain how we can leverage add_map() to add default values

Related Issues

  • Resolves https://github.com/dlt-hub/dlt/issues/1571

willi-mueller avatar Jul 10 '24 11:07 willi-mueller