[Bug] ConcurrentModificationException is thrown when calling Roaring64NavigableMap.getLongCardinality
Code of Conduct
- [X] I agree to follow this project's Code of Conduct
Search before asking
- [X] I have searched in the issues and found no similar issues.
Describe the bug
https://github.com/apache/incubator-uniffle/actions/runs/9381222136/job/25830065788?pr=1763
Affects Version(s)
master
Uniffle Server Log Output
[INFO] Running org.apache.uniffle.server.buffer.ShuffleBufferManagerTest
Error: Tests run: 13, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.347 s <<< FAILURE! - in org.apache.uniffle.server.buffer.ShuffleBufferManagerTest
Error: bufferSizeTest Time elapsed: 0.646 s <<< ERROR!
java.util.ConcurrentModificationException
at java.util.TreeMap$NavigableSubMap$SubMapIterator.nextEntry(TreeMap.java:1703)
at java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1751)
at java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1745)
at org.roaringbitmap.longlong.Roaring64NavigableMap.ensureCumulatives(Roaring64NavigableMap.java:553)
at org.roaringbitmap.longlong.Roaring64NavigableMap.getLongCardinality(Roaring64NavigableMap.java:278)
at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:637)
at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.bufferSizeTest(ShuffleBufferManagerTest.java:450)
Uniffle Engine Log Output
No response
Uniffle Server Configurations
No response
Uniffle Engine Configurations
No response
Additional context
No response
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
@rickyma After thinking about this issue and study Roaring64NavigableMap, I found that Roaring64NavigableMap is not thread-safe.
I decreased the sleep time and increased the retry max count to increase the race condition change like following changes.
private void waitForFlush(
ShuffleFlushManager shuffleFlushManager,
String appId,
int shuffleId,
int expectedBlockNum,
long expectedUsedMemory)
throws Exception {
int retry = 0;
long committedCount = 0;
do {
committedCount =
shuffleFlushManager.getCommittedBlockIds(appId, shuffleId).getLongCardinality();
if (committedCount < expectedBlockNum) {
Thread.sleep(5);
}
retry++;
if (retry > 1000) {
fail("Flush data time out");
}
} while (committedCount < expectedBlockNum);
// Need to wait for `event.doCleanup` to be executed
// to ensure the correctness of subsequent checks of
// `shuffleBufferManager.getUsedMemory()` and `shuffleBufferManager.getInFlushSize()`.
Awaitility.await()
.atMost(Duration.ofSeconds(5))
.until(() -> shuffleBufferManager.getUsedMemory() == expectedUsedMemory);
}
I encountered ConcurrentModificationException, ArrayIndexOutOfBoundsException and AssertError after 20 or more time executed test.
java.lang.ArrayIndexOutOfBoundsException: 3
at org.roaringbitmap.longlong.Roaring64NavigableMap.getLongCardinality(Roaring64NavigableMap.java:286)
at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:657)
at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.bufferSizeTest(ShuffleBufferManagerTest.java:451)
java.lang.AssertionError: Computed 1 differs from dummy binary-search index: -1
at org.roaringbitmap.longlong.Roaring64NavigableMap.ensureOne(Roaring64NavigableMap.java:642)
at org.roaringbitmap.longlong.Roaring64NavigableMap.ensureCumulatives(Roaring64NavigableMap.java:567)
at org.roaringbitmap.longlong.Roaring64NavigableMap.getLongCardinality(Roaring64NavigableMap.java:278)
at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:664)
at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:648)
at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.bufferSizeTest(ShuffleBufferManagerTest.java:443)
[2024-09-07 09:50:26.752] [main] [ERROR] ShuffleBufferManagerTest.waitForFlush - Ignored exception.
java.util.ConcurrentModificationException
at java.util.TreeMap$NavigableSubMap$SubMapIterator.nextEntry(TreeMap.java:1703)
at java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1751)
at java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1745)
at org.roaringbitmap.longlong.Roaring64NavigableMap.ensureCumulatives(Roaring64NavigableMap.java:553)
at org.roaringbitmap.longlong.Roaring64NavigableMap.getLongCardinality(Roaring64NavigableMap.java:278)
at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:664)
at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.bufferSizeTest(ShuffleBufferManagerTest.java:456)
In conclusion, I suggest to use lock or clone to avoid race condition, and we can use try catch to ignore this kind exception for testing.
I suggest to use lock or clone to avoid race condition
Seems fine.
we can use try catch to ignore this kind exception for testing
I think this is more of a hack than a fix.
I think this is more of a hack than a fix.
I'm not sure how many days and how difficult to solve this issue. As it is for testing purpose, do you think a fail and retry could be a simple way to make the test passed?
As motioned, the Roaring64NavigableMap is not thread-safe, and it is not worth to add lock for production code only to make test code avoid race.
You can refactor the test cases, or maybe just disable the tests.