pinot icon indicating copy to clipboard operation
pinot copied to clipboard

[colocated-join] Add Support for Deterministic Segment Assignment Strategy

Open ankitsultana opened this issue 2 years ago • 6 comments

Will use this PR to discuss some questions offline and update details here afterwards.

ankitsultana avatar Aug 11 '22 18:08 ankitsultana

Codecov Report

Merging #9199 (7f79c90) into master (802c596) will decrease coverage by 0.02%. The diff coverage is 54.54%.

:exclamation: Current head 7f79c90 differs from pull request most recent head 8ada81e. Consider uploading reports for the commit 8ada81e to get more accurate results

@@             Coverage Diff              @@
##             master    #9199      +/-   ##
============================================
- Coverage     68.74%   68.71%   -0.03%     
+ Complexity     4997     4755     -242     
============================================
  Files          1852     1852              
  Lines         98848    98877      +29     
  Branches      15039    15048       +9     
============================================
- Hits          67951    67944       -7     
- Misses        26044    26087      +43     
+ Partials       4853     4846       -7     
Flag Coverage Δ
integration1 26.27% <6.06%> (-0.07%) :arrow_down:
unittests1 67.13% <ø> (+0.04%) :arrow_up:
unittests2 15.27% <54.54%> (-0.03%) :arrow_down:

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../assignment/segment/RealtimeSegmentAssignment.java 89.28% <25.00%> (-5.02%) :arrow_down:
...e/assignment/segment/OfflineSegmentAssignment.java 87.30% <33.33%> (-4.30%) :arrow_down:
...ore/assignment/segment/SegmentAssignmentUtils.java 99.42% <100.00%> (+0.04%) :arrow_up:
...data/manager/realtime/DefaultSegmentCommitter.java 0.00% <0.00%> (-80.00%) :arrow_down:
...a/manager/realtime/RealtimeSegmentDataManager.java 75.00% <0.00%> (-25.00%) :arrow_down:
...er/api/resources/LLCSegmentCompletionHandlers.java 43.56% <0.00%> (-18.82%) :arrow_down:
...data/manager/realtime/SegmentCommitterFactory.java 70.58% <0.00%> (-11.77%) :arrow_down:
.../filter/predicate/InPredicateEvaluatorFactory.java 72.80% <0.00%> (-9.61%) :arrow_down:
...inot/core/util/SegmentCompletionProtocolUtils.java 57.69% <0.00%> (-7.70%) :arrow_down:
...n/java/org/apache/pinot/common/utils/URIUtils.java 66.66% <0.00%> (-7.41%) :arrow_down:
... and 33 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Aug 11 '22 19:08 codecov-commenter

This has the same behavior as setting numInstancesPerPartition to 1, and numPartitions to the partitions for the table. Currently we don't support explicitly setting partitions for RealtimeSegmentAssignment, so we should fix that instead of introducing this new strategy

I feel this is not exactly the same as numInstancesPerPartition=1, instead numInstancesPerPartition shall be the same as the number of replicas. In fact, the strategy is somewhat deterministic as the added stragy name suggests. And the main purpose is to make the assignment declarative and reduce the operational cost, e.g. we dont need to update numInstancesPerPartition when we increase numPartitions to scale out the table etc

yupeng9 avatar Aug 15 '22 19:08 yupeng9

This has the same behavior as setting numInstancesPerPartition to 1, and numPartitions to the partitions for the table. Currently we don't support explicitly setting partitions for RealtimeSegmentAssignment, so we should fix that instead of introducing this new strategy

I feel this is not exactly the same as numInstancesPerPartition=1, instead numInstancesPerPartition shall be the same as the number of replicas. In fact, the strategy is somewhat deterministic as the added stragy name suggests. And the main purpose is to make the assignment declarative and reduce the operational cost, e.g. we dont need to update numInstancesPerPartition when we increase numPartitions to scale out the table etc

This new strategy is trying to put all segments from the same segment partition to the same server, which is exactly the same as what numInstancesPerPartition = 1 is doing (and it is also deterministic). In order to achieve this, you should use the partition based replica-group assignment, and always put numInstancesPerPartition = 1. You may increase numPartitions if needed (should be very rare because it should be the same as table partitions, and changing it requires re-partitioning all the segments). Noted that numPartitions can be larger than the total instances within a replica-group. E.g. with 9 servers, you can have 3 replica-groups (3 servers per replica), then 18 partitions, and each server will host 6 partitions. You may scale it up to 18 servers, and run a rebalance without changing numInstancesPerPartition or numPartitions. After rebalancing, each server will host 3 partitions.

Jackie-Jiang avatar Aug 15 '22 23:08 Jackie-Jiang

I see whaat you mean. Then that's a reasonable suggestion. But I hope we can find a way to enforce this as it's not so easy for users to understand the details to setup colocation correctly. I feel either some alias, or some validation in the table config is needed.

yupeng9 avatar Aug 16 '22 20:08 yupeng9

@yupeng9 True. I discussed this with @ankitsultana and suggested that we may introduce a flag to mark colocated tables, then we can check the table config to enforce numInstancesPerPartition=1

Jackie-Jiang avatar Aug 16 '22 20:08 Jackie-Jiang

Currently we don't support explicitly setting partitions for RealtimeSegmentAssignment, so we should fix that instead of introducing this new strategy

Yeah I'll start working on it. I am not so sure about using numInstancesPerPartition=1, but at the same time I think I need to take a holistic look at the colocation setup once we have done some PoCs internally (ETA: 2 weeks from now).

Edit: Looks like you already added it: https://github.com/apache/pinot/commit/6c5d85ef7eeffcd4cddd639865b81764fb04dbf1. That's great

ankitsultana avatar Aug 26 '22 04:08 ankitsultana