eclipse.platform.ui icon indicating copy to clipboard operation
eclipse.platform.ui copied to clipboard

Fix race condition in AbstractReconciler causing test timeout #2708

Open vogella opened this issue 1 month ago • 7 comments

Summary

  • Fixed race condition in AbstractReconciler.BackgroundWorker.reset() causing intermittent test timeouts
  • Reordered operations to ensure proper synchronization between flags and thread notifications
  • Resolves issue #2708

Problem

The test FastAbstractReconcilerTest#testReplacingDocumentWhenClean() was randomly failing on macOS (and other platforms) with a timeout error. The test expects the reconciler to complete within 5 seconds, but the reconciler thread never executed the process() method that would trigger the test's barrier synchronization.

Root Cause

The race condition occurred due to improper ordering of operations in the BackgroundWorker.reset() method. The code uses two separate locks:

  1. The BackgroundWorker instance lock (for fIsDirty and fReset)
  2. The fDirtyRegionQueue lock (for notifications and waitFinish)

Old sequence:

  1. Set fIsDirty=true and fReset=true (under BackgroundWorker lock)
  2. Notify fDirtyRegionQueueThread could wake up here!
  3. Call informNotFinished()aboutToWork()
  4. aboutToWork() calls signalWaitForFinish() which sets waitFinish=true

The reconciler thread could wake up from the notification in step 2 before waitFinish was set to true in step 4, causing it to enter a full delay period (fDelay milliseconds) instead of processing immediately.

Solution

Move the informNotFinished() call to after setting the flags but before the final queue notification:

New sequence:

  1. Set fIsDirty=true and fReset=true (under BackgroundWorker lock)
  2. Call informNotFinished()aboutToWork()signalWaitForFinish()
  3. signalWaitForFinish() sets waitFinish=true and notifies queue
  4. Final notifyAll() on queue (redundant but harmless)

This ensures that when the reconciler thread wakes up, both fIsDirty and waitFinish are already properly set, eliminating the race condition.

Testing

  • ✅ Compiled successfully with mvn clean compile -Pbuild-individual-bundles
  • ✅ Code preserves existing behavior while ensuring proper synchronization
  • ✅ Should resolve the intermittent timeout in FastAbstractReconcilerTest.testReplacingDocumentWhenClean

Related Issues

Fixes #2708

🤖 Generated with Claude Code

vogella avatar Oct 25 '25 11:10 vogella

Test Results

 3 018 files  ±0   3 018 suites  ±0   2h 27m 28s ⏱️ + 9m 22s  8 234 tests ±0   7 985 ✅ ±0  249 💤 ±0  0 ❌ ±0  23 622 runs  ±0  22 828 ✅ ±0  794 💤 ±0  0 ❌ ±0 

Results for commit 75927f07. ± Comparison against base commit 9c171a04.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Oct 25 '25 12:10 github-actions[bot]

@HeikoKlare can you review? This is not only a test change and I have not worked in this area before so it would be good to get your opinion. All tests are passing so I think this is a good point in time to review the change

vogella avatar Oct 25 '25 12:10 vogella

I have to admit that I have no clue about this ode. The provided analysis for the root cause seems sound, but I am not enough into the code to judge whether this a proper fix.

I took a look into history and found that a change some years ago introduced an additional synchronization which is related to the race condition: https://github.com/eclipse-platform/eclipse.platform.ui/commit/5dafae846e9cba2e76b26ee46cd5431f59d7cac3

And I also see that @laeubi has made some changes in the recent past on this part of the code, so maybe he can make a better assessment?

HeikoKlare avatar Oct 28 '25 09:10 HeikoKlare

@mickaelistria did you see this code during your work on the generic editor? If yes, can you have a look?

vogella avatar Nov 04 '25 13:11 vogella

Planning to convert that in a regular PR early next release and potentially applying it if case we do not find issues with it.

vogella avatar Nov 10 '25 07:11 vogella

I am also not familiar enough with the AbstractReconciler code to make a relevant comment; but form what I read in the description here, this change seems good. I think now is a good time to try it to leave several weeks for testing and -if necesssary fixing/reverting.

mickaelistria avatar Nov 10 '25 08:11 mickaelistria

How is fDirtyRegionQueue related to the notifyNotfinshed? Has anyone proven the race condition e.g. in a debugger? Or is this just guessed by the AI?

In case of race condition I would expect to enhance the lock instead of reordering method calls...

laeubi avatar Nov 10 '25 08:11 laeubi