dlt icon indicating copy to clipboard operation
dlt copied to clipboard

Fix/1571 Incremental: Optionally load or ignore records with cursor_path missing or None value

Open willi-mueller opened this issue 1 year ago • 4 comments

Description

This PR: Allows users to specify what happens when the value at the incremental cursor path is None or the field is missing in a row. It also unifies the handling of null values in pandas/arrow with python objects.

Consider the following example data where created_at is the incremental cursor path:

```py
data_1 = [
  {"a": 1, "created_at": 1},
  {"a": 2, "created_at": None},
]

data_2 = [
  {"a": 1, "created_at": 1},
  {"a": 2},
]

The options are:

  1. incremental(..., on_cursor_value_missing="raise"). This will raise IncrementalCursorPathHasValueNone for data_1 and IncrementalCursorPathMissing for data_2.
  2. incremental(..., on_cursor_value_missing="include"). This will load all rows for both data_1 and data_2 respectively.
  3. incremental(..., on_cursor_value_missing="exclude"). This will load only the first row for both data_1 and data_2 respectively.

This PR also adds documentation on how to load data with None at the cursor path incrementally

All outlined features are implemented and tested for all 4 data formats (object, pandas, arrow-table, arrow batch). However, JSON path cursors are still only supported for JSON objects but not for arrow and pandas.

Done in collaboration with @francescomucio

TODO

  • [x] docs: explain new parameter
  • [x] docs: explain how we can leverage add_map() to add default values

Related Issues

  • Resolves https://github.com/dlt-hub/dlt/issues/1571

willi-mueller avatar Jul 10 '24 11:07 willi-mueller

Deploy Preview for dlt-hub-docs ready!

Name Link
Latest commit 4ce384a610e6e748b72567f86343e34480ac4dee
Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66d1b5e64a6fec0008236e69
Deploy Preview https://deploy-preview-1576--dlt-hub-docs.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

netlify[bot] avatar Jul 10 '24 11:07 netlify[bot]

@willi-mueller this is an important feature, thank you! I have a question: records with None value at the incremental cursor will be re-loaded each time, right? Could maybe some fall-back cursor be specified in this case?

VioletM avatar Jul 29 '24 09:07 VioletM

@willi-mueller this is an important feature, thank you! I have a question: records with None value at the incremental cursor will be re-loaded each time, right? Could maybe some fall-back cursor be specified in this case?

Thank you @VioletM. You raise an important question! Yes, each time the None values appears in the source it will be processed. My proposal would be to use add_map() to specify the default fallback cursor.

What do you think about the proposed changes in docs/website/docs/general-usage/incremental-loading.md proposed here: https://github.com/dlt-hub/dlt/pull/1650?

How can we mention your point there too?

willi-mueller avatar Jul 30 '24 07:07 willi-mueller

@rudolfix Currently, the following test is failing: tests/pipeline/test_pipeline_extra.py::test_dump_trace_freeze_exception

The reason is that before this PR, the following exception was ignored: AttributeError: 'TestRow' object has no attribute 'get' when trying to access TestRow.get("id"). Instead, the IncrementalCursorPathMissing was raised.

However, in this PR, I wanted to:

  1. we do not ignore the exception when the record object does not implement __get__().
  2. assume that every row implements keys().

Do you think these two points are appropriate?

Just to let the test suite pass, I restored the old behavior in 06aff12969c5284218f5c73c67c424908f6050d1 passing any exception and using a run-time type check to decide whether to use keys() or hasattr().

Further, the test case is concerned with how dlt handles traces and uses the incremental only to produce a certain trace and it is not testing the incremental implementation.

What would you advise?

willi-mueller avatar Aug 08 '24 16:08 willi-mueller