improve deadlock detection around slow/unresponsive replies
The problem: consider the case where multi/exec or lua transactions fetch large bulks of data and their commands are stuck during the replies (stuck on socket send). If these transactions are still in the tx queue, then the whole queue can not progress. In most acute scenarios it can lead to client-initiated deadlocks. See https://github.com/dragonflydb/dragonfly/issues/4182 for example.
It is easy to simulate with pipeline_queue_limit=10 and running multiple gets on several huge large keys in pipeline mode together with another connection running multi-exec on the same keys. Once these keys are locked, and gets will be placed into tx queue, we may create a deadlock because the pipelined connection won't be able to progress and it will stall Dragonfly globally.
We have tx_queue_warning_len that helps identifying these scenarios but it's too noisy because transaction length can grow due to valid reasons.
Solution: maintain a timer for a multi-hop transaction per shard queue. We will identify a problematic scenario based on two signals, how long the head is the tx queue and the queue length.
- The first milestone would be just to track the problematic state and reduce the noiseness of this warning.
- I am sure it is possible to recognise the multi-exec transaction state where it finished with its current command but still resides in the queue because of the next commands. This will provide even more precise identification that can be added to the warning.
- With https://github.com/dragonflydb/dragonfly/pull/4330 we also track the send delay, which potentially can lead to self-healing mechanism that force closes connections that are being stuck. In matter of fact, this can be useful for other scenarios like pubsub. See #4182 for example.
@romange I want to take up this enhancment. I have good experience with C++ and have used redis at work.
It's not a good task for an external contributor. Having said that we have a bunch of tasks that are more suitable for someone who is not familiar with the codebase: https://github.com/dragonflydb/dragonfly/issues?q=is%3Aissue%20state%3Aopen%20label%3A%22help%20wanted%22
@romange can I close this task? I believe this is handled by send_timeout