beam icon indicating copy to clipboard operation
beam copied to clipboard

Use now + 2mins as the end timestamp for change stream read API if the connector endTimestamp is omitted

Open changliiu opened this issue 7 months ago • 8 comments

V1 change stream can use null end timestamp for the query, however V2 the end timestamp of the query should be NOT NULL, and should be at most 30 mins from the max(now, start_timestamp).

To allow users to still omit the connector endTimestamp field to run the connector forever, but to give a valid endTimestamp when try to query change stream, we set the change stream endTimestamp in this case as now + 2 mins.

This solution works as the Apache beam checkpoints the ReadChangeStreamPartition execution every 5s or 5MB of output data produced. Moreover the change stream query has a hard 1 min deadline.

changliiu avatar May 16 '25 00:05 changliiu

Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment assign set of reviewers

github-actions[bot] avatar May 17 '25 02:05 github-actions[bot]

Checked failed integration test, not relevant to this PR. No need to block the review.

changliiu avatar May 30 '25 17:05 changliiu

assign set of reviewers

changliiu avatar May 30 '25 17:05 changliiu

Assigning reviewers:

R: @m-trieu for label java. R: @nielm for label spanner.

Note: If you would like to opt out of this review, comment assign to next reviewer.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

The PR bot will only process comments in the main thread (not review comments).

github-actions[bot] avatar May 30 '25 17:05 github-actions[bot]

Reminder, please take a look at this pr: @m-trieu @nielm

github-actions[bot] avatar Jun 07 '25 12:06 github-actions[bot]

Assigning new set of reviewers because Pr has gone too long without review. If you would like to opt out of this review, comment assign to next reviewer:

R: @robertwb for label java. R: @nielm for label spanner.

Available commands:

  • stop reviewer notifications - opt out of the automated review tooling
  • remind me after tests pass - tag the comment author after tests pass
  • waiting on author - shift the attention set back to the author (any comment or push by the author will return the attention set to the reviewers)

github-actions[bot] avatar Jun 11 '25 12:06 github-actions[bot]

Did we run a pipeline to test this?

Yes I ran a pipeline containing this change in order to run a connector without specifying end timestamp.

However, I don't record any evidence for that. If you have opinion, I can run again and put evidence somewhere maybe here. Please let me know your thoughts, thanks!

changliiu avatar Jun 13 '25 17:06 changliiu

Just so I understand things correctly:

  1. The change stream v2 partition token does not have an explicit end timestamp
  2. Users however are required to provide an end timestamp to their change stream V2 queries within 30 minutes in the future
  3. We chose 2 minutes, since it is relatively small and > the Dataflow 5s checkpointing.

Was wondering if you could confirm?

nancyxu825 avatar Jun 13 '25 20:06 nancyxu825

Just so I understand things correctly:

  1. The change stream v2 partition token does not have an explicit end timestamp
  2. Users however are required to provide an end timestamp to their change stream V2 queries within 30 minutes in the future
  3. We chose 2 minutes, since it is relatively small and > the Dataflow 5s checkpointing.

Was wondering if you could confirm?

  1. V2 change stream required end_timestamp, and it's cannot be omitted.
  2. However, connector end_timestamp can be skipped. If so, the connector needs to run forever.
  3. Yes.

Please let me know if more details/explanation is needed. Thanks!

changliiu avatar Jun 16 '25 17:06 changliiu

Friendly ping :)

changliiu avatar Jun 16 '25 17:06 changliiu

please fix:

Execution failed for task ':sdks:java:io:google-cloud-platform:spotlessJavaCheck'.
> The following files had format violations:
      sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/spanner/changestreams/action/QueryChangeStreamAction.java
          @@ -299,7 +299,8 @@
           ??}
           
           ??//?Return?(now?+?2?mins)?as?the?end?timestamp?for?reading?change?streams.?This?is?only?used?if
          -??//?users?want?to?run?the?connector?forever.?This?approach?works?because?Google?Dataflow?checkpoints
          +??//?users?want?to?run?the?connector?forever.?This?approach?works?because?Google?Dataflow
          +??//?checkpoints
           ??//?every?5s?or?5MB?output?provided?and?the?change?stream?query?has?deadline?for?1?min.
           ??private?Timestamp?getNextReadChangeStreamEndTimestamp()?{
           ????final?Timestamp?current?=?Timestamp.now();
  Run './gradlew :sdks:java:io:google-cloud-platform:spotlessApply' to fix these violations.

Abacn avatar Jun 18 '25 21:06 Abacn