pinot icon indicating copy to clipboard operation
pinot copied to clipboard

[Real-time Table Replication X clusters][1/n] Creating new tables with designated consuming segments

Open wirybeaver opened this issue 3 months ago • 3 comments

This the first part of a series of changes aimed at enabling real-time table replication between two Pinot clusters.

The core purpose of this specific PR is to introduce the necessary functionality for creating a new real-time table with consuming segment watermarks inducted from source table's ZK metadata and a designated broker / server tenant when performing a table copy operation.

The key changes introduced by the 3 commits are:

  • Copy table with designated tenant and watermarks: Modifies the table copy mechanism to include the specification of a designated tenant and watermarks.
  • Create first consuming segments per partition: Introduces logic to create the initial consuming segments for each partition based on the provided watermarks.
  • Induct the watermark of consuming status: Updates the consuming status and watermarks during the table copy process: Reuse the exsiting pinotHelixResourceManager api which will fetch the currently consuming status per partition. Basically, that api can dump the table's IdealState and output the last segment sequence number per partition by parsing the segment name.
    • If the segment status is IN_PROGRESS, we consider this segment is the CONSUMING segment, thus we copy the startOffset and sequence number as the start consuming position of that partition for new table.
    • If the segment status is DONE, the consuming segment is the one next to it, thus we will use the endOffset and 1 + sequence_number to initialize the consuming segment for new table.

Doesn't support following feature at the moment:

  • upsert / offline table replication.
  • shallow copy of deep segment.
  • auto pause the table if the consuming segment's start offset expired in kafka broker.
  • auto re-upload segments if no deep url in the segment metadata.
  • pauseless table.

Test We have a manual Test in Uber's pinot cluster. Confirm that new consuming segments of the target cluster copy the sequence number (rather than 0) and startOffset per partition when the latest segments in the source table are consuming segments. And the consuming status is Active.

Design Draft

wirybeaver avatar Nov 19 '25 07:11 wirybeaver

Codecov Report

:x: Patch coverage is 37.28814% with 111 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 63.23%. Comparing base (82b7d2a) to head (cb65334). :warning: Report is 22 commits behind head on master.

Files with missing lines Patch % Lines
...oller/api/resources/PinotTableRestletResource.java 19.51% 63 Missing and 3 partials :warning:
...ot/controller/api/resources/CopyTableResponse.java 0.00% 22 Missing :warning:
...ntroller/helix/core/PinotHelixResourceManager.java 40.00% 13 Missing and 2 partials :warning:
...ller/api/resources/PinotRealtimeTableResource.java 0.00% 6 Missing :warning:
...not/controller/api/resources/CopyTablePayload.java 83.33% 2 Missing :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17235      +/-   ##
============================================
- Coverage     63.25%   63.23%   -0.03%     
- Complexity     1475     1477       +2     
============================================
  Files          3162     3167       +5     
  Lines        188668   189074     +406     
  Branches      28869    28918      +49     
============================================
+ Hits         119348   119566     +218     
- Misses        60074    60235     +161     
- Partials       9246     9273      +27     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.20% <37.28%> (-0.02%) :arrow_down:
java-21 63.21% <37.28%> (-0.02%) :arrow_down:
temurin 63.23% <37.28%> (-0.03%) :arrow_down:
unittests 63.23% <37.28%> (-0.03%) :arrow_down:
unittests1 55.57% <100.00%> (-0.03%) :arrow_down:
unittests2 34.03% <36.72%> (+0.03%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov-commenter avatar Nov 19 '25 08:11 codecov-commenter

@Jackie-Jiang @chenboat Could you take a review? The manual integration test gets done already. The purpose of this PR is to copy the schema and table config from the source cluster to the target cluster by replacing the server / broke tenant. And then copy the consuming segment to kick off the message consumption on the new table. The backfill of segments prior to consuming segments will be implemented in a follow up PR

wirybeaver avatar Dec 13 '25 02:12 wirybeaver

The unit test Set-1 fail in java 11 but succeed in java 21. The error is about "selectionCombineOperator Not found issue", which is nothing related to table copy. I just do a rebase without any conflict resolving. Previous tests were all successful.

wirybeaver avatar Dec 19 '25 00:12 wirybeaver

Walk through the design doc with @abhishekbafna on 12/23 and aligned on the direction of the solution.

wirybeaver avatar Jan 01 '26 20:01 wirybeaver

@abhishekbafna I add the dryRun as a boolean flag in the payload. The reply contains schema, modified table config and watermarkResult as well. The write operation (schema and table config addition) is moved to the end.

wirybeaver avatar Jan 07 '26 02:01 wirybeaver

MSE against 1.3 failed but succeed against 1.4 and master

wirybeaver avatar Jan 07 '26 03:01 wirybeaver

thanks for contributing. LGTM, i'd suggest waiting another stamp from @chenboat before ship it!

deemoliu avatar Jan 14 '26 17:01 deemoliu