flink icon indicating copy to clipboard operation
flink copied to clipboard

[FLINK-36293][state] make sure register is closed before verification in testAsyncCancellation

Open showuon opened this issue 1 year ago • 2 comments

What is the purpose of the change

Saw this flaky testAsyncCancellation test failed with

org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapperTest.testAsyncCancellation -- Time elapsed: 0.121 s <<< ERROR!
Sep 16 02:20:08 java.lang.Exception: Unexpected exception, expected<org.apache.flink.runtime.execution.CancelTaskException> but was<java.lang.AssertionError>
Sep 16 02:20:08 Caused by: java.lang.AssertionError: 
Sep 16 02:20:08 Expecting actual:
Sep 16 02:20:08   2
Sep 16 02:20:08 to be less than:
Sep 16 02:20:08   2 
Sep 16 02:20:08 	at org.apache.flink.contrib.streaming.state.RocksDBWriteBatchWrapperTest.testAsyncCancellation(RocksDBWriteBatchWrapperTest.java:98)

After investigation, I found this is because the slow async register.close() causes the main verification failure. Because the register.close is slow, the infinite loop will go beyond cancellationCheckInterval * 2. To make this test reliable, we can make sure the register is completely closed after 1st run of writeBatchWrapper.put, so that we can still verify we can tolerate some delay of cancellation propagation.

Brief change log

Make testAsyncCancellation test reliable by making sure register is closed before verification

Verifying this change

This change is already covered by existing tests, such as testAsyncCancellation.

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no
  • If yes, how is the feature documented? (not applicable

showuon avatar Sep 22 '24 06:09 showuon

@StefanRRichter @rkhachatryan , please take a look when available. Thanks.

showuon avatar Sep 22 '24 06:09 showuon

CI report:

  • 47608dea1f0a43adb4190ef1e6444631352707a9 Azure: SUCCESS
Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

flinkbot avatar Sep 22 '24 06:09 flinkbot

@showuon Sorry for the late, could you rebase this on the latest master branch?

reswqa avatar Jan 15 '25 02:01 reswqa

@reswqa , PR updated. Please take a look. Thanks.

showuon avatar Jan 15 '25 04:01 showuon

@flinkbot run azure

Zakelly avatar Jan 16 '25 07:01 Zakelly

@Zakelly , it's weird that the CI result is not displayed in github. I can see the result is passed here using the commit: 47608de. Thanks.

showuon avatar Jan 17 '25 01:01 showuon

Looks like it should be ok now

image

reswqa avatar Jan 17 '25 02:01 reswqa