kafka icon indicating copy to clipboard operation
kafka copied to clipboard

KAFKA-18024 ConcurrentModificationException in Kafka OffsetFetcher - Proposal for Thread-Safety Fix - Update OffsetFetcher.java

Open kamil-adam-nowak opened this issue 1 year ago • 10 comments

KAFKA-18024 ConcurrentModificationException in Kafka OffsetFetcher - Proposal for Thread-Safety Fix replaced HashMap with ConcurrentHashMap to ensure thread safety during concurrent modifications

More detailed description available in KAFKA-18024

All tests before and after the change pass in the same way. Unfortunately, I did not add additional tests specifically for this change, as it is difficult to test this type of modification.

Committer Checklist (excluded from commit message)

  • [x] Verify design and implementation
  • [ ] Verify test coverage and CI build status
  • [ ] Verify documentation (including upgrade notes)

@kamalcph

kamil-adam-nowak avatar Nov 15 '24 15:11 kamil-adam-nowak

Thanks for the patch, LGTM. The traversal and removal of entries from Map can happen concurrently due to retries.

@chia7712 Could you please take a second look? Thanks!

kamalcph avatar Nov 16 '24 16:11 kamalcph

@mimaison @frankvicky @FrankYang0529 @gongxuanzhang @chia7712 Can you please take a second look?

kamil-adam-nowak avatar Nov 29 '24 08:11 kamil-adam-nowak

@lianetm—IIRC, you looked at the offset fetch logic a fair amount in the past. Pinging you so you can weigh in (if you want).

kirktrue avatar Dec 14 '24 01:12 kirktrue

@kamalcph @kirktrue What should we do now? Are we waiting for additional verification, or could you merge this? We are still waiting after months, even though the fix is simple.

kamil-adam-nowak avatar Jan 30 '25 10:01 kamil-adam-nowak

That's interesting. remainingToSearch is a local variable, and it's modified within a synchronized block (synchronized (future)). How can it be modified concurrently? Have you tested the 3.9.0 version?

chia7712 avatar Jan 30 '25 21:01 chia7712

I tested version 3.5.1:
https://github.com/apache/kafka/blob/3.5.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java#L228

In this version, there is a synchronized (future) block too, but I still got a ConcurrentModificationException.

Only the future object is synchronized; remainingToSearch is not.
It looks like another thread modified remainingToSearch while our thread was inside the synchronized (future) block.

kamil-adam-nowak avatar Feb 03 '25 11:02 kamil-adam-nowak

@kamalcph @kirktrue @chia7712

I'm not entirely sure if my patch is the perfect solution, but it works for me and helps resolve the issue. As seen in the stacktrace, the error occurs in the version without this patch, which indicates that something is definitely wrong. It might be worth merging this patch or, at the very least, exploring alternative fixes to address the problem.

Thanks!

kamil-adam-nowak avatar Mar 07 '25 11:03 kamil-adam-nowak

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

github-actions[bot] avatar Jun 06 '25 03:06 github-actions[bot]

@lianetm @mimaison @frankvicky @FrankYang0529 @gongxuanzhang @chia7712 Can you please take a second look? @kamalcph wrote it looked good to him. The problem still exists. The bug is in the code - users shouldn’t be getting a ConcurrentModificationException, which definitely means something is wrong. My fix works for me, so it’s worth merging it or finding another solution to this problem.

kamil-adam-nowak avatar Jun 09 '25 14:06 kamil-adam-nowak

@kamil-adam-nowak—The fix seems fine, but the lack of understanding of the exact cause as well as the (understandable) lack of tests lowers the confidence of the reviewers. I don't think the fix would make things worse, though.

kirktrue avatar Jun 11 '25 22:06 kirktrue

This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.

If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).

If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.

github-actions[bot] avatar Sep 10 '25 03:09 github-actions[bot]

This PR has been closed since it has not had any activity in 120 days. If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open the PR and ask for a review.

github-actions[bot] avatar Oct 10 '25 03:10 github-actions[bot]