geode
geode copied to clipboard
GEODE-10403: Fix distributed deadlock with stop gw sender
There is a distributed deadlock that can appear when stopping the gateway sender if a race condition happens in which the stop gateway sender command gets blocked indefinitely trying to get the size of the queue from remote peers (ParallelGatewaySenderQueue.size() call) and also one call to store one event in the queue tries to get the lifecycle lock (acquired by the gateway sender command).
These two calls could get into a deadlock under heavy load and make the system unresponsive for any traffic request (get, put, ...).
In order to avoid it, in the storage of the event in the gateway sender queue (AbstractGatewaySender.distribute() call), instead of trying to get the lifecycle lock without any timeout, a try with a timeout is added. If the try returns false it is checked if the gateway sender is running. If it is not running, the event is dropped and there is no need to get the lock. Otherwise, the lifecycle lock acquire is retried until it succeeds or the gateway sender is stopped.
For all changes:
-
[ ] Is there a JIRA ticket associated with this PR? Is it referenced in the commit message?
-
[ ] Has your PR been rebased against the latest commit within the target branch (typically
develop
)? -
[ ] Is your initial contribution a single, squashed commit?
-
[ ] Does
gradlew build
run cleanly? -
[ ] Have you written or updated unit tests to verify your changes?
-
[ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?