[CURATOR-318] Threads may return different boolean values when entering same double barrier
To my understanding, when all threads are trying enter an barrier, they should all success or fail, which means their return values should be the same.
But actually they may get different return values in this situation (reproduce steps):
0. Some preparing works such as running a zk server, basic curator connecting codes;
1. Prepare 3 threads: thread1/ thread2/ thread3;
2. Thread1 sleep 20 seconds then enter barrier, thread2 and thread3 try to enter barrier right now, with timeout value set to 5 seconds;
3. Result: thread2 and thread3 returned false due to timeout as expected, but thread1 (the sleeping one) just return true, which I think should be false too.
Possible root cause as I observed via zkCli:
When thread1 and thread2 enter methods returned, their path nodes remained, so when thread3 came, it just think other threads are still waiting, so it just created the ready node and return with true.
If this is not by design, it should be a design defect.
Originally reported by shiliang, imported from: Threads may return different boolean values when entering same double barrier
- status: Open
- priority: Major
- resolution: Unresolved
- imported: 2025-01-21
I ran the test but I have no idea what I'm looking at. Please re-write the test as a TestNG unit test with an asserts, etc. that show the problem. You can use the examples of the copious Curator tests.
htuy:
I added a test for the problem. I've done a PR of a simple fix, which mostly resolved the problem. I believe there are still potential race conditions, but they are dramatically reduced (before they were basically infinite, ie once double barrier entrance timed out for any client the barrier was essentially broken).