starrocks icon indicating copy to clipboard operation
starrocks copied to clipboard

[Feature] Implement tablet reshard job in FE for tablet splitting

Open xiangguangyxg opened this issue 3 weeks ago • 13 comments

Why I'm doing:

This PR implements the tablet reshard job functionality in StarRocks FE (Frontend) for tablet splitting and merging operations in shared-data mode.

What I'm doing:

Overview

This commit introduces the SplitTabletJob class and refactors the tablet resharding infrastructure to support tablet splitting and merging in StarRocks' shared-data (lake) mode. The implementation follows a state machine pattern with clear state transitions: PENDING → PREPARING → RUNNING → CLEANING → FINISHED.

Key Changes

New Classes:

  • SplitTabletJob: Core job class implementing the tablet split workflow with 6 states (PENDING, PREPARING, RUNNING, CLEANING, FINISHED, ABORTING/ABORTED)
  • ReshardingPhysicalPartition: Tracks resharding context for a physical partition
  • ReshardingMaterializedIndex: Tracks resharding context for a materialized index
  • TabletRange: Represents the key range of a tablet after splitting

Refactored Classes:

  • TabletReshardJob: Converted to abstract base class defining the job lifecycle
  • SplitTabletJobFactory: Updated to create ReshardingPhysicalPartition and ReshardingMaterializedIndex structures
  • PublishTabletsInfo: Simplified to use List<ReshardingTabletInfoPB> instead of separate lists
  • ReshardingTablet interface: Added getOldTabletIds(), getNewTabletIds(), and toProto() methods

Proto Changes:

  • Replaced ReshardingTabletsInfoPB with ReshardingTabletInfoPB (union-style message)
  • Added tablet_ranges field in PublishVersionResponse for returning new tablet ranges
  • Removed find_split_point RPC (functionality moved elsewhere)

Deleted Classes:

  • PhysicalPartitionContext: Replaced by ReshardingPhysicalPartition
  • ReshardingTablets: Logic distributed to new classes
  • ReshardingTabletContext: Renamed to ReshardingTabletInfo

Workflow

  1. PENDING → PREPARING: Set table state to TABLET_RESHARD, allocate transaction ID, update partition versions, add new tablets to inverted index, register resharding tablets
  2. PREPARING → RUNNING: Wait for previous versions to be published
  3. RUNNING → CLEANING: Publish split transaction to CN, update tablet ranges, add new materialized indexes to catalog
  4. CLEANING → FINISHED: Wait for in-flight transactions to complete, remove old materialized indexes, restore table state to NORMAL

Testing

Added SplitTabletJobTest covering:

  • Normal job execution flow
  • Job replay for crash recovery
  • Job abort scenarios

Fixes #64986

What type of PR is this:

  • [ ] BugFix
  • [x] Feature
  • [ ] Enhancement
  • [ ] Refactor
  • [ ] UT
  • [ ] Doc
  • [ ] Tool

Does this PR entail a change in behavior?

  • [ ] Yes, this PR will result in a change in behavior.
  • [x] No, this PR will not result in a change in behavior.

If yes, please specify the type of change:

  • [ ] Interface/UI changes: syntax, type conversion, expression evaluation, display information
  • [ ] Parameter changes: default values, similar parameters but with different default values
  • [ ] Policy changes: use new policy to replace old one, functionality automatically enabled
  • [ ] Feature removed
  • [ ] Miscellaneous: upgrade & downgrade compatibility, etc.

Checklist:

  • [x] I have added test cases for my bug fix or my new feature
  • [ ] This pr needs user documentation (for new or modified features or behaviors)
    • [ ] I have added documentation for my new feature or new function
  • [ ] This is a backport pr

Bugfix cherry-pick branch check:

  • [x] I have checked the version labels which the pr will be auto-backported to the target branch
    • [ ] 4.0
    • [ ] 3.5
    • [ ] 3.4
    • [ ] 3.3

[!NOTE] Implements the FE tablet-splitting reshard job with a refactored resharding model, updated lake-service protos, new config/property semantics, and publish-version support for returning new tablet ranges.

  • FE/Resharding (Shared-data):
    • Introduces SplitTabletJob with full lifecycle (PENDING→FINISHED), registering/unregistering resharding tablets and updating ranges after publish.
    • Adds ReshardingPhysicalPartition and ReshardingMaterializedIndex; refactors TabletReshardJob to abstract base; renames context to ReshardingTabletInfo.
    • Simplifies PublishTabletsInfo to use List<ReshardingTabletInfoPB>; updates ReshardingTablet to expose getOldTabletIds(), getNewTabletIds(), and toProto().
    • Updates SplittingTablet/MergingTablet/IdenticalTablet to emit new proto payloads.
    • Adds TabletRange plus Tuple/Variant converters.
    • TabletReshardJobMgr now tracks ReshardingTabletInfo.
  • Proto/BE integration:
    • Replaces ReshardingTabletsInfoPB with union-style ReshardingTabletInfoPB.
    • PublishVersionResponse includes tablet_ranges; requests carry resharding_tablet_infos.
    • Removes find_split_point RPC; plumbs new fields through Utils aggregate/single publish flows.
  • SQL/Config:
    • Replaces tablet_reshard_split_size with tablet_reshard_target_size; raises max split count; parser/analyzer and SplitTabletClause adjusted.
    • Adds lock helper ctor AutoCloseableLock(dbId, tableId, ...).
  • Tests:
    • Adds/updates UTs for publish info, split job lifecycle/replay/abort, job mgr, parser for new property.
  • Removals:
    • Deletes PhysicalPartitionContext and ReshardingTablets; renames ReshardingTabletContext to ReshardingTabletInfo.

Written by Cursor Bugbot for commit 8ecf3301d09d9e50f2735d88c188ce008ce2b46d. This will update automatically on new commits. Configure here.

xiangguangyxg avatar Dec 04 '25 16:12 xiangguangyxg

🧪 CI Insights

Here's what we observed from your CI run for 8ecf3301.

🟢 All jobs passed!

But CI Insights is watching 👀

mergify[bot] avatar Dec 04 '25 16:12 mergify[bot]

@cursor review

alvin-celerdata avatar Dec 04 '25 21:12 alvin-celerdata

@cursor review

alvin-celerdata avatar Dec 05 '25 04:12 alvin-celerdata

@cursor review

alvin-celerdata avatar Dec 09 '25 17:12 alvin-celerdata

@cursor review

alvin-celerdata avatar Dec 10 '25 04:12 alvin-celerdata

@cursor review

alvin-celerdata avatar Dec 10 '25 06:12 alvin-celerdata

@cursor review

alvin-celerdata avatar Dec 10 '25 17:12 alvin-celerdata

@cursor review

alvin-celerdata avatar Dec 11 '25 15:12 alvin-celerdata

@cursor review

alvin-celerdata avatar Dec 12 '25 04:12 alvin-celerdata

[Java-Extensions Incremental Coverage Report]

:white_check_mark: pass : 0 / 0 (0%)

github-actions[bot] avatar Dec 12 '25 08:12 github-actions[bot]

[FE Incremental Coverage Report]

:white_check_mark: pass : 432 / 540 (80.00%)

file detail

path covered_line new_line coverage not_covered_line_detail
:large_blue_circle: com/starrocks/catalog/Tuple.java 0 2 00.00% [60, 64]
:large_blue_circle: com/starrocks/alter/reshard/TabletReshardException.java 0 4 00.00% [23, 24, 27, 28]
:large_blue_circle: com/starrocks/alter/reshard/TabletReshardUtils.java 0 6 00.00% [22, 23, 27, 28, 29, 31]
:large_blue_circle: com/starrocks/sql/analyzer/AlterTableClauseAnalyzer.java 0 2 00.00% [1271, 1272]
:large_blue_circle: com/starrocks/catalog/Variant.java 0 2 00.00% [81, 85]
:large_blue_circle: com/starrocks/common/util/PropertyAnalyzer.java 0 5 00.00% [1615, 1617, 1618, 1621, 1628]
:large_blue_circle: com/starrocks/alter/reshard/MergingTablet.java 5 8 62.50% [61, 65, 69]
:large_blue_circle: com/starrocks/catalog/TabletRange.java 9 14 64.29% [27, 28, 29, 40, 41]
:large_blue_circle: com/starrocks/sql/ast/SplitTabletClause.java 2 3 66.67% [35]
:large_blue_circle: com/starrocks/alter/reshard/SplitTabletJobFactory.java 46 64 71.88% [124, 125, 131, 176, 177, 179, 180, 181, 183, 186, 192, 199, 200, 201, 202, 206, 210, 211]
:large_blue_circle: com/starrocks/lake/Utils.java 9 12 75.00% [181, 182, 183]
:large_blue_circle: com/starrocks/alter/reshard/IdenticalTablet.java 7 9 77.78% [60, 69]
:large_blue_circle: com/starrocks/alter/reshard/ReshardingPhysicalPartition.java 28 36 77.78% [86, 92, 93, 94, 95, 96, 97, 98]
:large_blue_circle: com/starrocks/alter/reshard/TabletReshardJob.java 10 12 83.33% [121, 199]
:large_blue_circle: com/starrocks/alter/reshard/SplitTabletJob.java 263 307 85.67% [84, 88, 92, 96, 153, 160, 161, 163, 166, 192, 216, 237, 238, 271, 273, 274, 308, 309, 381, 382, 385, 386, 398, 403, 406, 409, 423, 424, 432, 433, 445, 459, 460, 473, 502, 503, 504, 515, 532, 547, 554, 588, 589, 590]
:large_blue_circle: com/starrocks/alter/reshard/SplittingTablet.java 7 8 87.50% [55]
:large_blue_circle: com/starrocks/alter/reshard/ReshardingTabletInfo.java 6 6 100.00% []
:large_blue_circle: com/starrocks/common/Config.java 2 2 100.00% []
:large_blue_circle: com/starrocks/alter/reshard/PublishTabletsInfo.java 9 9 100.00% []
:large_blue_circle: com/starrocks/alter/reshard/ReshardingMaterializedIndex.java 13 13 100.00% []
:large_blue_circle: com/starrocks/common/util/concurrent/lock/AutoCloseableLock.java 2 2 100.00% []
:large_blue_circle: com/starrocks/alter/reshard/TabletReshardJobMgr.java 8 8 100.00% []
:large_blue_circle: com/starrocks/persist/gson/GsonUtils.java 6 6 100.00% []

github-actions[bot] avatar Dec 12 '25 08:12 github-actions[bot]

[BE Incremental Coverage Report]

:white_check_mark: pass : 0 / 0 (0%)

github-actions[bot] avatar Dec 12 '25 08:12 github-actions[bot]