kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-17306; Soften the validation when replaying tombstones

Open dajac opened this issue 1 year ago • 4 comments

This patch fixes a few buts in the replay logic of the consumer group records:

  • The first issue is that the logic assumed that the group or the member exists when tombstones are replayed. Obviously, this is incorrect after a restart. The group or the member may not me there anymore if the __consumer_offsets partitions only contains tombstones for the group or the member. The patch fixes this by considering tombstones as no-ops if the entity does not exist.
  • The second issue is that the logic assumed that consumer group records are always in a specific order in the log so the logic was only accepting to create a consumer group when ConsumerGroupMemberMetadata record is replayed. This is obviously incorrect too. During the life time of a consumer group, the records may be in different order. The patch fixes this by allowing the creating of a consumer group by any record.
  • The third issue is that it is possible to replay offset commit records for a specific consumer group before the consumer group is actually created while replying its records. By default the OffsetMetadataManager creates a simple classic group to hold those offset commits. When the consumer offset records are finally replayed, the logic will fail because a classic group already exists. The patch fixes this by converting a simple classic group when records for a consumer group are replayed.

All those combinations are hard to test with unit tests. This patch adds an integration tests which reproduces some of those interleaving of records. I used them to reproduce the issues describe above.

We really need to work on adding simulation tests for the new group coordinator. I would not have thought about covering the compaction in those but we should do it. We should compact the log and also trigger reload of the coordinator.

Committer Checklist (excluded from commit message)

  • [ ] Verify design and implementation
  • [ ] Verify test coverage and CI build status
  • [ ] Verify documentation (including upgrade notes)

dajac avatar Aug 16 '24 12:08 dajac

@jeffkbkim @frankvicky Thanks for your comments. I have addressed them.

dajac avatar Aug 19 '24 12:08 dajac

thanks, left a minor comment. Seems like we have a bit of test failures related to this PR

@jeffkbkim Which tests are you referring to? They actually look pretty good to me.

dajac avatar Aug 21 '24 09:08 dajac

i see kafka.api.SaslSslConsumerTest.testCoordinatorFailover(String, String).quorum=kraft+kip848.groupProtocol=consumer failing. is this unrelated?

jeffkbkim avatar Aug 22 '24 16:08 jeffkbkim

i see kafka.api.SaslSslConsumerTest.testCoordinatorFailover(String, String).quorum=kraft+kip848.groupProtocol=consumer failing. is this unrelated?

I checked but I could not link it to my changes. It seems to be a flaky test.

dajac avatar Aug 23 '24 11:08 dajac

@dajac Could you please fix conflicts?

chia7712 avatar Sep 04 '24 08:09 chia7712

@chia7712 I will. I’d like to merge https://github.com/apache/kafka/pull/17057 before this one.

dajac avatar Sep 05 '24 05:09 dajac

I’d like to merge https://github.com/apache/kafka/pull/17057 before this one.

done :smile:

chia7712 avatar Sep 05 '24 05:09 chia7712

will give the +1 when the conflicts are fixed. Thanks!

jolshan avatar Sep 06 '24 16:09 jolshan

@chia7712 @jolshan The PR is ready for another round.

dajac avatar Sep 09 '24 12:09 dajac

Looks like there is another thread leak in the tests. I don't think it came from this change. I will restart the tests.

jolshan avatar Sep 09 '24 18:09 jolshan

@chia7712 @jolshan I addressed the last comments.

dajac avatar Sep 10 '24 08:09 dajac