KAFKA-18024 ConcurrentModificationException in Kafka OffsetFetcher - Proposal for Thread-Safety Fix - Update OffsetFetcher.java
KAFKA-18024 ConcurrentModificationException in Kafka OffsetFetcher - Proposal for Thread-Safety Fix
replaced HashMap with ConcurrentHashMap to ensure thread safety during concurrent modifications
More detailed description available in KAFKA-18024
All tests before and after the change pass in the same way. Unfortunately, I did not add additional tests specifically for this change, as it is difficult to test this type of modification.
Committer Checklist (excluded from commit message)
- [x] Verify design and implementation
- [ ] Verify test coverage and CI build status
- [ ] Verify documentation (including upgrade notes)
@kamalcph
Thanks for the patch, LGTM. The traversal and removal of entries from Map can happen concurrently due to retries.
@chia7712 Could you please take a second look? Thanks!
@mimaison @frankvicky @FrankYang0529 @gongxuanzhang @chia7712 Can you please take a second look?
@lianetm—IIRC, you looked at the offset fetch logic a fair amount in the past. Pinging you so you can weigh in (if you want).
@kamalcph @kirktrue What should we do now? Are we waiting for additional verification, or could you merge this? We are still waiting after months, even though the fix is simple.
That's interesting. remainingToSearch is a local variable, and it's modified within a synchronized block (synchronized (future)). How can it be modified concurrently? Have you tested the 3.9.0 version?
I tested version 3.5.1:
https://github.com/apache/kafka/blob/3.5.1/clients/src/main/java/org/apache/kafka/clients/consumer/internals/OffsetFetcher.java#L228
In this version, there is a synchronized (future) block too, but I still got a ConcurrentModificationException.
Only the future object is synchronized; remainingToSearch is not.
It looks like another thread modified remainingToSearch while our thread was inside the synchronized (future) block.
@kamalcph @kirktrue @chia7712
I'm not entirely sure if my patch is the perfect solution, but it works for me and helps resolve the issue. As seen in the stacktrace, the error occurs in the version without this patch, which indicates that something is definitely wrong. It might be worth merging this patch or, at the very least, exploring alternative fixes to address the problem.
Thanks!
This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.
If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).
If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.
@lianetm @mimaison @frankvicky @FrankYang0529 @gongxuanzhang @chia7712 Can you please take a second look? @kamalcph wrote it looked good to him. The problem still exists. The bug is in the code - users shouldn’t be getting a ConcurrentModificationException, which definitely means something is wrong. My fix works for me, so it’s worth merging it or finding another solution to this problem.
@kamil-adam-nowak—The fix seems fine, but the lack of understanding of the exact cause as well as the (understandable) lack of tests lowers the confidence of the reviewers. I don't think the fix would make things worse, though.
This PR is being marked as stale since it has not had any activity in 90 days. If you would like to keep this PR alive, please leave a comment asking for a review. If the PR has merge conflicts, update it with the latest from the base branch.
If you are having difficulty finding a reviewer, please reach out on the [mailing list](https://kafka.apache.org/contact).
If this PR is no longer valid or desired, please feel free to close it. If no activity occurs in the next 30 days, it will be automatically closed.
This PR has been closed since it has not had any activity in 120 days. If you feel like this was a mistake, or you would like to continue working on it, please feel free to re-open the PR and ask for a review.