MultiStageReplicaGroupSelector drops segments during a rebalance with instance reassignment
- https://github.com/apache/pinot/pull/15843#discussion_r2105706875
- Essentially, during the time period when instance partitions in ZK and the segment assignment in the ideal state are out of sync for a table during a rebalance with instance reassignment, the
MultiStageReplicaGroupSelectorwill drop segments. - Consider this scenario (extreme example where all servers are replaced for illustrative purposes) - a table initially has two replica groups, RG0:
{instance-0, instance-1}, RG1:{instance-2, instance-3}. The table config is updated (let's say a server tenant / tag change) and a rebalance with instance reassignment is triggered. The instance partitions in ZK is updated to RG0:{instance-100, instance-101}, RG1:{instance-102, instance-103}. After this change is made, and before the ideal state is fully updated by the table rebalancer to move all the segments to the new servers, theMultiStageReplicaGroupSelectorwill compute anullpartitionId here, and this will lead to the returned segment to selected instance map being empty. - The proposal to fix this issue is an overhaul of the
MultiStageReplicaGroupSelectorlogic. We can track the instance partitions in the ideal state (through new list fields); these lists will contain both the old set of instances and the new set of instances during a rebalance with instance reassignment. After the rebalance is successfully completed, the lists will be updated to contain only the new set of instances. TheMultiStageReplicaGroupSelectorcan use this new metadata from the ideal state to choose the instances for a request, instead of relying on the ZK instance partitions. This also paves the way for other replica group based routing strategies. - The difference between the ideal state instance partitions and the instance partitions stored separately in the property store in ZK is that the ideal state version will be used for query routing (and can contain intermediate states) whereas the dedicated instance partitions ZNode will always contain the target instance partitions which is used for making assignment decisions for any new segments (except for upsert tables).
@shauryachats : can you review the PR whenever it is raised?
Do remember that the key requirement that this selector was enabling was to ensure that the same set of instances are picked for all tables involved in the query to enable Colocated Joins (assuming all of them re-use the same instancePartitionsMap).
@ankitsultana that makes sense - however, in these scenarios where a table's instance assignment is updated and it is rebalanced, there will most likely be a brief period where the two tables that are supposed to be colocated will no longer be colocated (until the second table is also rebalanced with instance reassignment enabled). During this period, the existing behavior of MultiStageReplicaGroupSelector is to silently drop segments in the SelectionResult as demonstrated in https://github.com/apache/pinot/pull/17180. This then leads to incorrect query results that aren't even flagged as such. Is that the desired behavior?
In the proposed fix in https://github.com/apache/pinot/pull/17215, the new behavior in these rebalance scenarios would be to continue fully serving queries, but potentially without supporting colocation (since the instance partitions of the two tables could be out of sync, and so could the segment assignments). We're planning to introduce the new ideal state instance partitions metadata anyway, because that is useful to us for other reasons, but in case you want to retain the older behavior for MultiStageReplicaGroupSelector I'm happy to revert those changes. This new ideal state instance partitions metadata will be gated by a new controller config defaulting to false though, so MultiStageReplicaGroupSelector will retain its default behavior unless the new config is explicitly enabled and all the tables are rebalanced so that their ideal states contain the new metadata for routing.
Based on https://github.com/apache/pinot/pull/15843 though, I'm guessing that the existing behavior is not intentional or desirable (since the existing logic would also break colocation support in case of partial segment unavailability where we select instances across replica groups).
cc - @Jackie-Jiang