pinot icon indicating copy to clipboard operation
pinot copied to clipboard

Ideal state instance partitions for better replica group based routing

Open yashmayya opened this issue 3 months ago • 1 comments

  • Fix for https://github.com/apache/pinot/issues/17179:
  • The proposal to fix this issue is an overhaul of the MultiStageReplicaGroupSelector logic. We can track the instance partitions in the ideal state (through new list fields); these lists will contain both the old set of instances and the new set of instances during a rebalance with instance reassignment. After the rebalance is successfully completed, the lists will be updated to contain only the new set of instances. The MultiStageReplicaGroupSelector can use this new metadata from the ideal state to choose the instances for a request, instead of relying on the ZK instance partitions. This also paves the way for other replica group based routing strategies.
  • The difference between the ideal state instance partitions and the instance partitions stored separately in the property store in ZK is that the ideal state version will be used for query routing (and can contain intermediate states) whereas the dedicated instance partitions ZNode will always contain the target instance partitions which is used for making assignment decisions for any new segments (except for upsert tables).
  • The existing algorithm in MultiStageReplicaGroupSelector is retained as a fallback because existing tables won't have the necessary metadata to support the new algorithm until and unless a rebalance is performed with some actual instance reassignment. (Edit: also because the new ideal state instance partitions metadata needs to be explicitly enabled using a new controller config controller.enable.ideal.state.instance.partitions).
  • The remaining edge case is in rare scenarios where a server is moved from one replica group / instance partition to another in a rebalance. In most regular scenarios like changing server tags and adding / removing instances with minimize data movement enabled, this shouldn't occur, but there's nothing in the instance assignment algorithm that precludes this possibility. As a future enhancement for accurate replica group based routing strategies, the instance assignment algorithm should be updated to make sure that instances don't move from one replica group / partition to another (at least in a single direct step).

yashmayya avatar Nov 14 '25 20:11 yashmayya

Codecov Report

:x: Patch coverage is 81.20000% with 47 lines in your changes missing coverage. Please review. :white_check_mark: Project coverage is 63.27%. Comparing base (5b2b90d) to head (24dac57). :warning: Report is 45 commits behind head on master.

Files with missing lines Patch % Lines
...ntroller/helix/core/rebalance/TableRebalancer.java 58.82% 13 Missing and 8 partials :warning:
...ntroller/helix/core/PinotHelixResourceManager.java 60.71% 10 Missing and 1 partial :warning:
...ources/PinotInstanceAssignmentRestletResource.java 89.33% 6 Missing and 2 partials :warning:
...routing/instanceselector/BaseInstanceSelector.java 81.81% 2 Missing and 2 partials :warning:
...stanceselector/MultiStageReplicaGroupSelector.java 94.11% 1 Missing and 1 partial :warning:
...ker/routing/instanceselector/InstanceSelector.java 0.00% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##             master   #17215    +/-   ##
==========================================
  Coverage     63.27%   63.27%            
- Complexity     1474     1476     +2     
==========================================
  Files          3161     3167     +6     
  Lines        188427   189255   +828     
  Branches      28831    28962   +131     
==========================================
+ Hits         119218   119758   +540     
- Misses        59983    60202   +219     
- Partials       9226     9295    +69     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.23% <81.20%> (-0.01%) :arrow_down:
java-21 63.25% <81.20%> (+0.03%) :arrow_up:
temurin 63.27% <81.20%> (+<0.01%) :arrow_up:
unittests 63.27% <81.20%> (+<0.01%) :arrow_up:
unittests1 55.57% <91.66%> (-0.09%) :arrow_down:
unittests2 34.08% <81.20%> (+0.11%) :arrow_up:

Flags with carried forward coverage won't be shown. Click here to find out more.

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

:rocket: New features to boost your workflow:
  • :package: JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

codecov-commenter avatar Nov 14 '25 21:11 codecov-commenter