airbyte
airbyte copied to clipboard
[Low-Code CDK] Rremove stream_cursor_field from stream and derive it from stream_slicer
Closes https://github.com/airbytehq/airbyte/issues/21561
What
For connectors that implement incremental syncs, we currently require that the cursor be specified in two location. stream_cursor_field
on DeclarativeStream and cursor_field
on various Stream Slicers. This is verbose and often confusing as it is not made clear that a cursors are only used for incremental syncs which are implemented by way of a stream slicer.
How
This PR deprecates the stream_cursor_field
field on a DeclarativeStream. Under the hood while constructing declarative components, a Stream will derive the cursor value by looking at -> Stream -> Retriever -> Slicer -> Cursor Field. The schema is modified and relevant constructors.
Where this starts to get a little bit tricky is in how some of the existing connectors were using the stream_cursor_value
in some unique ways. This wasn't quite a one size fits all solution so I did not use a migration script. I will annotate some comments to explain my thinking on how I modified some of the connectors now that the field has been deprecated on the stream.
Recommended reading order
-
declarative_component_schema.yaml
-
model_to_component_factory.py
@brianjlai
We can have more complex flow path then this: Stream -> Retriever -> Slicer -> Cursor Field
like this:
Stream -> Retriever -> CartesianProductStreamSlicer
- sub_slicer1 (Substream slicer)
- sub_slicer2 (Datetime slicer) <- Cursor Field is here !!!
@grubberr Yes that is correct. I tried to account for this using a temporary workaround here: https://github.com/airbytehq/airbyte/pull/22294/files#r1093992449
This was a case I noticed during the migration. In the case where the cartesian stream slicer is defined, I opt to get the first occurrence of a cursor_field when it exists. So far, I have only seen cartesian product stream slicers where only one slicer in the list has a cursor_field
and in that scenario we return the first one we find. Does that sound okay to you?
@brianjlai yes I have 2 cases: https://github.com/airbytehq/airbyte/pull/20960 https://github.com/airbytehq/airbyte/pull/21446 for both it has to be ok.
Thank you
@grubberr
Yes, I believe both should work with this approach. Both of your new connectors have custom slicer implementations that inherit CartesianProductStreamSlicer
and both have a child DatetimeStreamSlicer which should allow the stream to derive the cursor_field
which is passed from the $options
. There is another existing connector source-posthog
that uses a custom CartesianProductStreamSlicer that I will double check works w/ these changes that should behave similarly to yours.
@grubberr
I verified that for source-posthog
that the stream's underlying field stream_cursor_field
will be populated w/ the value of the date slicer's cursor_value
even though it is underneath a cartesian.