airbyte icon indicating copy to clipboard operation
airbyte copied to clipboard

[Low-Code CDK] Rremove stream_cursor_field from stream and derive it from stream_slicer

Open brianjlai opened this issue 2 years ago • 1 comments

Closes https://github.com/airbytehq/airbyte/issues/21561

What

For connectors that implement incremental syncs, we currently require that the cursor be specified in two location. stream_cursor_field on DeclarativeStream and cursor_field on various Stream Slicers. This is verbose and often confusing as it is not made clear that a cursors are only used for incremental syncs which are implemented by way of a stream slicer.

How

This PR deprecates the stream_cursor_field field on a DeclarativeStream. Under the hood while constructing declarative components, a Stream will derive the cursor value by looking at -> Stream -> Retriever -> Slicer -> Cursor Field. The schema is modified and relevant constructors.

Where this starts to get a little bit tricky is in how some of the existing connectors were using the stream_cursor_value in some unique ways. This wasn't quite a one size fits all solution so I did not use a migration script. I will annotate some comments to explain my thinking on how I modified some of the connectors now that the field has been deprecated on the stream.

Recommended reading order

  1. declarative_component_schema.yaml
  2. model_to_component_factory.py

brianjlai avatar Feb 02 '23 03:02 brianjlai

@brianjlai We can have more complex flow path then this: Stream -> Retriever -> Slicer -> Cursor Field

like this:

Stream -> Retriever -> CartesianProductStreamSlicer

  • sub_slicer1 (Substream slicer)
  • sub_slicer2 (Datetime slicer) <- Cursor Field is here !!!

grubberr avatar Feb 02 '23 09:02 grubberr

@grubberr Yes that is correct. I tried to account for this using a temporary workaround here: https://github.com/airbytehq/airbyte/pull/22294/files#r1093992449

This was a case I noticed during the migration. In the case where the cartesian stream slicer is defined, I opt to get the first occurrence of a cursor_field when it exists. So far, I have only seen cartesian product stream slicers where only one slicer in the list has a cursor_field and in that scenario we return the first one we find. Does that sound okay to you?

brianjlai avatar Feb 02 '23 22:02 brianjlai

@brianjlai yes I have 2 cases: https://github.com/airbytehq/airbyte/pull/20960 https://github.com/airbytehq/airbyte/pull/21446 for both it has to be ok.

Thank you

grubberr avatar Feb 02 '23 22:02 grubberr

@grubberr Yes, I believe both should work with this approach. Both of your new connectors have custom slicer implementations that inherit CartesianProductStreamSlicer and both have a child DatetimeStreamSlicer which should allow the stream to derive the cursor_field which is passed from the $options. There is another existing connector source-posthog that uses a custom CartesianProductStreamSlicer that I will double check works w/ these changes that should behave similarly to yours.

brianjlai avatar Feb 02 '23 23:02 brianjlai

@grubberr I verified that for source-posthog that the stream's underlying field stream_cursor_field will be populated w/ the value of the date slicer's cursor_value even though it is underneath a cartesian.

brianjlai avatar Feb 03 '23 00:02 brianjlai