incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[Bug] ConcurrentModificationException is thrown when calling Roaring64NavigableMap.getLongCardinality

Open rickyma opened this issue 1 year ago • 4 comments

Code of Conduct

Search before asking

  • [X] I have searched in the issues and found no similar issues.

Describe the bug

https://github.com/apache/incubator-uniffle/actions/runs/9381222136/job/25830065788?pr=1763

Affects Version(s)

master

Uniffle Server Log Output

[INFO] Running org.apache.uniffle.server.buffer.ShuffleBufferManagerTest
Error:  Tests run: 13, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 6.347 s <<< FAILURE! - in org.apache.uniffle.server.buffer.ShuffleBufferManagerTest
Error:  bufferSizeTest  Time elapsed: 0.646 s  <<< ERROR!
java.util.ConcurrentModificationException
	at java.util.TreeMap$NavigableSubMap$SubMapIterator.nextEntry(TreeMap.java:1703)
	at java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1751)
	at java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1745)
	at org.roaringbitmap.longlong.Roaring64NavigableMap.ensureCumulatives(Roaring64NavigableMap.java:553)
	at org.roaringbitmap.longlong.Roaring64NavigableMap.getLongCardinality(Roaring64NavigableMap.java:278)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:637)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.bufferSizeTest(ShuffleBufferManagerTest.java:450)

Uniffle Engine Log Output

No response

Uniffle Server Configurations

No response

Uniffle Engine Configurations

No response

Additional context

No response

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

rickyma avatar Jun 13 '24 03:06 rickyma

@rickyma After thinking about this issue and study Roaring64NavigableMap, I found that Roaring64NavigableMap is not thread-safe.

I decreased the sleep time and increased the retry max count to increase the race condition change like following changes.

 private void waitForFlush(
      ShuffleFlushManager shuffleFlushManager,
      String appId,
      int shuffleId,
      int expectedBlockNum,
      long expectedUsedMemory)
      throws Exception {
    int retry = 0;
    long committedCount = 0;
    do {
      committedCount =
          shuffleFlushManager.getCommittedBlockIds(appId, shuffleId).getLongCardinality();
      if (committedCount < expectedBlockNum) {
        Thread.sleep(5);
      }
      retry++;
      if (retry > 1000) {
        fail("Flush data time out");
      }
    } while (committedCount < expectedBlockNum);

    // Need to wait for `event.doCleanup` to be executed
    // to ensure the correctness of subsequent checks of
    // `shuffleBufferManager.getUsedMemory()` and `shuffleBufferManager.getInFlushSize()`.
    Awaitility.await()
        .atMost(Duration.ofSeconds(5))
        .until(() -> shuffleBufferManager.getUsedMemory() == expectedUsedMemory);
  }

I encountered ConcurrentModificationException, ArrayIndexOutOfBoundsException and AssertError after 20 or more time executed test.

java.lang.ArrayIndexOutOfBoundsException: 3

	at org.roaringbitmap.longlong.Roaring64NavigableMap.getLongCardinality(Roaring64NavigableMap.java:286)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:657)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.bufferSizeTest(ShuffleBufferManagerTest.java:451)
image
java.lang.AssertionError: Computed 1 differs from dummy binary-search index: -1

	at org.roaringbitmap.longlong.Roaring64NavigableMap.ensureOne(Roaring64NavigableMap.java:642)
	at org.roaringbitmap.longlong.Roaring64NavigableMap.ensureCumulatives(Roaring64NavigableMap.java:567)
	at org.roaringbitmap.longlong.Roaring64NavigableMap.getLongCardinality(Roaring64NavigableMap.java:278)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:664)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:648)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.bufferSizeTest(ShuffleBufferManagerTest.java:443)
[2024-09-07 09:50:26.752] [main] [ERROR] ShuffleBufferManagerTest.waitForFlush - Ignored exception.
java.util.ConcurrentModificationException
	at java.util.TreeMap$NavigableSubMap$SubMapIterator.nextEntry(TreeMap.java:1703)
	at java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1751)
	at java.util.TreeMap$NavigableSubMap$SubMapEntryIterator.next(TreeMap.java:1745)
	at org.roaringbitmap.longlong.Roaring64NavigableMap.ensureCumulatives(Roaring64NavigableMap.java:553)
	at org.roaringbitmap.longlong.Roaring64NavigableMap.getLongCardinality(Roaring64NavigableMap.java:278)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.waitForFlush(ShuffleBufferManagerTest.java:664)
	at org.apache.uniffle.server.buffer.ShuffleBufferManagerTest.bufferSizeTest(ShuffleBufferManagerTest.java:456)

In conclusion, I suggest to use lock or clone to avoid race condition, and we can use try catch to ignore this kind exception for testing.

maobaolong avatar Sep 07 '24 01:09 maobaolong

I suggest to use lock or clone to avoid race condition

Seems fine.

we can use try catch to ignore this kind exception for testing

I think this is more of a hack than a fix.

rickyma avatar Sep 08 '24 07:09 rickyma

I think this is more of a hack than a fix.

I'm not sure how many days and how difficult to solve this issue. As it is for testing purpose, do you think a fail and retry could be a simple way to make the test passed?

As motioned, the Roaring64NavigableMap is not thread-safe, and it is not worth to add lock for production code only to make test code avoid race.

maobaolong avatar Sep 09 '24 02:09 maobaolong

You can refactor the test cases, or maybe just disable the tests.

rickyma avatar Sep 09 '24 03:09 rickyma