KAFKA-17306; Soften the validation when replaying tombstones
This patch fixes a few buts in the replay logic of the consumer group records:
- The first issue is that the logic assumed that the group or the member exists when tombstones are replayed. Obviously, this is incorrect after a restart. The group or the member may not me there anymore if the __consumer_offsets partitions only contains tombstones for the group or the member. The patch fixes this by considering tombstones as no-ops if the entity does not exist.
- The second issue is that the logic assumed that consumer group records are always in a specific order in the log so the logic was only accepting to create a consumer group when
ConsumerGroupMemberMetadatarecord is replayed. This is obviously incorrect too. During the life time of a consumer group, the records may be in different order. The patch fixes this by allowing the creating of a consumer group by any record. - The third issue is that it is possible to replay offset commit records for a specific consumer group before the consumer group is actually created while replying its records. By default the OffsetMetadataManager creates a simple classic group to hold those offset commits. When the consumer offset records are finally replayed, the logic will fail because a classic group already exists. The patch fixes this by converting a simple classic group when records for a consumer group are replayed.
All those combinations are hard to test with unit tests. This patch adds an integration tests which reproduces some of those interleaving of records. I used them to reproduce the issues describe above.
We really need to work on adding simulation tests for the new group coordinator. I would not have thought about covering the compaction in those but we should do it. We should compact the log and also trigger reload of the coordinator.
Committer Checklist (excluded from commit message)
- [ ] Verify design and implementation
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)
@jeffkbkim @frankvicky Thanks for your comments. I have addressed them.
thanks, left a minor comment. Seems like we have a bit of test failures related to this PR
@jeffkbkim Which tests are you referring to? They actually look pretty good to me.
i see kafka.api.SaslSslConsumerTest.testCoordinatorFailover(String, String).quorum=kraft+kip848.groupProtocol=consumer failing. is this unrelated?
i see
kafka.api.SaslSslConsumerTest.testCoordinatorFailover(String, String).quorum=kraft+kip848.groupProtocol=consumerfailing. is this unrelated?
I checked but I could not link it to my changes. It seems to be a flaky test.
@dajac Could you please fix conflicts?
@chia7712 I will. I’d like to merge https://github.com/apache/kafka/pull/17057 before this one.
I’d like to merge https://github.com/apache/kafka/pull/17057 before this one.
done :smile:
will give the +1 when the conflicts are fixed. Thanks!
@chia7712 @jolshan The PR is ready for another round.
Looks like there is another thread leak in the tests. I don't think it came from this change. I will restart the tests.
@chia7712 @jolshan I addressed the last comments.