conductor Lock system causes slowness issues

Hello all, we are currently using Conductor as mono-instance. Because of some issues with tasks executed twice, we have decided to use the Lock mechanism as suggest in the user guide/documentation to see if it was effectively solving it. We decided to use the LocalOnlyLock.

However, once introduced our workflow execution rate drop drastically. This happens whenever we have a lot of sub-workflows containing synchronous system tasks (for example JQ_TASK). We have our own implementation of sub-workflow kept as a synchronous system task.

In this scenario, trying to debug our application, it seems that once the last task of the sub-workflow is completed the decider tries to "decide" on the parent workflow. However, the parent workflow is still locked so we have a sort of deadlock. The second lock at some point will timeout (thanks to the timeToTry), and it will give back the hand to the parent workflow that will release the lock. At this point, the parent workflow will only go further whenever the Sweeper will redecide it.

We also realize that we have another issue when we have a task failing in the workflow. Similarly, whenever the decider chose to fail a task as FAILED_WITH_TERMINAL_ERROR, it will raise a TerminateWorkflowException. Therefore, a second lock will be requested on the same workflow, and this causes to slow down the failure of the workflow until the first lock expires.

I was wondering if this is something that you have experienced as well. If yes, do you have any idea how could we fix this?

Thank you

Nov 09 '20 10:11 andrea11

@andrea11 Apparently, the Semaphore implementation in LocalOnlyLock is not re-entrant, which explains the delays in obtaining the lock and thus evaluating the workflows. Replacing Semaphore with Java's import java.util.concurrent.locks.ReentrantLock might resolve this issue. Marking this as a bug, but if you're blocked by this and could get to this before us, please evaluate and send us a PR. Thanks for reporting.

Nov 10 '20 01:11 kishorebanala

Thank you @kishorebanala for your quick reply. I tried to simply replace the semaphore with the ReentrantLock, as you suggested. However, I think that this is not working fine, because of the mechanism to release lock in the future: today, whenever you request a lock with a release timeout, there is a thread that is scheduled to release it in the future. However, with reentrantLock, you cannot release the lock from another Thread. Maybe a solution could be to keep track of the latest Thread for each semaphore, and every time a lock is requested, check if the requester Thread has already obtained the lock. In this case, we could skip the request of the lock. What do you think? Do you have any other ideas?

Nov 12 '20 08:11 andrea11

However, with reentrantLock, you cannot release the lock from another Thread.

@andrea11 Good point!

Maybe a solution could be to keep track of the latest Thread for each semaphore, and every time a lock is requested, check if the requester Thread has already obtained the lock.

How do you suggest doing this?

Nov 17 '20 00:11 kishorebanala

conductor conductor copied to clipboard

Lock system causes slowness issues

conductor
conductor copied to clipboard