dlt
dlt copied to clipboard
Fix/1571 Incremental: Optionally load or ignore records with cursor_path missing or None value
Description
This PR:
Allows users to specify what happens when the value at the incremental cursor path is None
or the field is missing in a row. It also unifies the handling of null values in pandas/arrow with python objects.
Consider the following example data where created_at
is the incremental cursor path:
```py
data_1 = [
{"a": 1, "created_at": 1},
{"a": 2, "created_at": None},
]
data_2 = [
{"a": 1, "created_at": 1},
{"a": 2},
]
The options are:
-
incremental(..., on_cursor_value_missing="raise")
. This will raiseIncrementalCursorPathHasValueNone
fordata_1
andIncrementalCursorPathMissing
fordata_2
. -
incremental(..., on_cursor_value_missing="include")
. This will load all rows for bothdata_1
anddata_2
respectively. -
incremental(..., on_cursor_value_missing="exclude")
. This will load only the first row for bothdata_1
anddata_2
respectively.
This PR also adds documentation on how to load data with None
at the cursor path incrementally
All outlined features are implemented and tested for all 4 data formats (object, pandas, arrow-table, arrow batch). However, JSON path cursors are still only supported for JSON objects but not for arrow and pandas.
Done in collaboration with @francescomucio
TODO
- [x] docs: explain new parameter
- [x] docs: explain how we can leverage
add_map()
to add default values
Related Issues
- Resolves https://github.com/dlt-hub/dlt/issues/1571