kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-15170: CooperativeStickyAssignor may fail adjust assignment

Open flashmouse opened this issue 2 years ago • 4 comments

ConstrainedAssignmentBuilder in AbstractStickyAssignor cannot build partitionsTransferringOwnership correctly, that would result in execute adjustAssignment fail in CooperativeStickyAssignor

flashmouse avatar Jul 06 '23 11:07 flashmouse

please note another bugs in this implementation.

the function assignRackAwareRoundRobin of ConstrainedAssignmentBuilder have 2 logical error:

  1. (about line 726)when assignmentCount >= minQuota, the consumer should be added to

unfilledMembersWithExactlyMinQuotaPartitions only if

currentNumMembersWithOverMinQuotaPartitions < expectedNumMembersWithOverMinQuotaPartitions

  1. (about line 739)unfilledMembersWithExactlyMinQuotaPartitions should execute clear when

assignment.get(consumer).size() + 1 == maxQuota.

this error will lead to fail execute verifyUnfilledMembers, throw exception, raise new rebalance and return the same assignment result so group may stuck forever.

flashmouse avatar Jul 13 '23 07:07 flashmouse

please note another bugs in this implementation.

the function assignRackAwareRoundRobin of ConstrainedAssignmentBuilder have 2 logical error:

  1. (about line 726)when assignmentCount >= minQuota, the consumer should be added to

unfilledMembersWithExactlyMinQuotaPartitions only if

currentNumMembersWithOverMinQuotaPartitions < expectedNumMembersWithOverMinQuotaPartitions

  1. (about line 739)unfilledMembersWithExactlyMinQuotaPartitions should execute clear when

assignment.get(consumer).size() + 1 == maxQuota.

this error will lead to fail execute verifyUnfilledMembers, throw exception, raise new rebalance and return the same assignment result so group may stuck forever.

I suggest verifyUnfilledMembers shouldn't throw Exception because the error it checks only caused to assignment not full balance, but is still a valid result, we cannot allow rebalance stuck forever in such situation.

flashmouse avatar Jul 13 '23 08:07 flashmouse

@rreddy-22 I see you are familiar with this implementation, plz help me check this, thx!

flashmouse avatar Jul 13 '23 08:07 flashmouse

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please ask a committer for review. If the PR has merge conflicts, please update it with the latest from trunk (or appropriate release branch)

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

github-actions[bot] avatar Nov 06 '23 03:11 github-actions[bot]

@flashmouse can you rebase this PR please?

ableegoldman avatar May 14 '24 03:05 ableegoldman

updated to latest commit now @ableegoldman

flashmouse avatar May 14 '24 06:05 flashmouse

Test failures are unrelated, merging to trunk

ableegoldman avatar May 14 '24 23:05 ableegoldman

Merged to trunk and cherrypicked back to 3.7

ableegoldman avatar May 15 '24 00:05 ableegoldman