dlt
dlt copied to clipboard
Fix/1571 Incremental: Optionally load or ignore records with cursor_path missing or None value
Description
This PR:
Allows users to specify what happens when the value at the incremental cursor path is None or the field is missing in a row. It also unifies the handling of null values in pandas/arrow with python objects.
Consider the following example data where created_at is the incremental cursor path:
```py
data_1 = [
{"a": 1, "created_at": 1},
{"a": 2, "created_at": None},
]
data_2 = [
{"a": 1, "created_at": 1},
{"a": 2},
]
The options are:
incremental(..., on_cursor_value_missing="raise"). This will raiseIncrementalCursorPathHasValueNonefordata_1andIncrementalCursorPathMissingfordata_2.incremental(..., on_cursor_value_missing="include"). This will load all rows for bothdata_1anddata_2respectively.incremental(..., on_cursor_value_missing="exclude"). This will load only the first row for bothdata_1anddata_2respectively.
This PR also adds documentation on how to load data with None at the cursor path incrementally
All outlined features are implemented and tested for all 4 data formats (object, pandas, arrow-table, arrow batch). However, JSON path cursors are still only supported for JSON objects but not for arrow and pandas.
Done in collaboration with @francescomucio
TODO
- [x] docs: explain new parameter
- [x] docs: explain how we can leverage
add_map()to add default values
Related Issues
- Resolves https://github.com/dlt-hub/dlt/issues/1571
Deploy Preview for dlt-hub-docs ready!
| Name | Link |
|---|---|
| Latest commit | 4ce384a610e6e748b72567f86343e34480ac4dee |
| Latest deploy log | https://app.netlify.com/sites/dlt-hub-docs/deploys/66d1b5e64a6fec0008236e69 |
| Deploy Preview | https://deploy-preview-1576--dlt-hub-docs.netlify.app |
| Preview on mobile | Toggle QR Code...Use your smartphone camera to open QR code link. |
To edit notification comments on pull requests, go to your Netlify site configuration.
@willi-mueller this is an important feature, thank you! I have a question: records with None value at the incremental cursor will be re-loaded each time, right? Could maybe some fall-back cursor be specified in this case?
@willi-mueller this is an important feature, thank you! I have a question: records with None value at the incremental cursor will be re-loaded each time, right? Could maybe some fall-back cursor be specified in this case?
Thank you @VioletM. You raise an important question! Yes, each time the None values appears in the source it will be processed. My proposal would be to use add_map() to specify the default fallback cursor.
What do you think about the proposed changes in docs/website/docs/general-usage/incremental-loading.md proposed here: https://github.com/dlt-hub/dlt/pull/1650?
How can we mention your point there too?
@rudolfix Currently, the following test is failing:
tests/pipeline/test_pipeline_extra.py::test_dump_trace_freeze_exception
The reason is that before this PR, the following exception was ignored: AttributeError: 'TestRow' object has no attribute 'get' when trying to access TestRow.get("id"). Instead, the IncrementalCursorPathMissing was raised.
However, in this PR, I wanted to:
- we do not ignore the exception when the record object does not implement
__get__(). - assume that every row implements
keys().
Do you think these two points are appropriate?
Just to let the test suite pass, I restored the old behavior in 06aff12969c5284218f5c73c67c424908f6050d1 passing any exception and using a run-time type check to decide whether to use keys() or hasattr().
Further, the test case is concerned with how dlt handles traces and uses the incremental only to produce a certain trace and it is not testing the incremental implementation.
What would you advise?