go-redis icon indicating copy to clipboard operation
go-redis copied to clipboard

Eliminate redundant dial mutex causing unbounded connection queue contention

Open LINKIWI opened this issue 6 months ago • 2 comments

The immediate symptom this PR attempts to address: during periods of transient server connectivity errors, go-redis commands time out after upwards of 60s (or more), even though the socket read/write timeouts are 3s and the context timeout on the commands is 1s. We have root caused this bug to lock contention in the connection pool's lazy dialer.

There is a mutual exclusion lock in DialHook, which allows only one server dial to occur at the same time. In the event of server connectivity errors, this causes unbounded connection queueing under highly concurrent workloads.

Consider, for example, a concurrent workload with the default dial timeout of 5s, and an unresponsive server endpoint. During this period, all dials are timing out.

  1. The connection pool is empty, or all connections are currently occupied by in-flight I/O.
  2. N commands are executed concurrently, all of which need to acquire new connections.
  3. All N commands attempt to add a connection to the pool, which lazily calls DialHook.
  4. The first connection attempt acquires the mutex, and times out after 5s.
  5. The second connection attempt is blocked on mutex acquisition, and after acquiring the lock, itself also times out after 5s. The total wall clock time that the second command was blocked is now 10s.
  6. ...
  7. This results in a cascading failure mode where, under this scenario, individual commands can occupy multiple minutes of wall clock time due to lock contention.

DialHook itself does not mutate any state in the hooksMixin. I believe the mutex is redundant, and can be eliminated. The original commit that introduced the lock attempts to fix a race condition encountered in AddHook. The mutex that guards chain should be sufficient for this purpose.

We have validated that this change fixes the unbounded queueing, and prevents the system from entering a prolonged state of not serving any useful throughput, during these periods.

I have also added a unit test to capture the regression. Without this patch, the unit test correctly fails; Ping takes successively longer on each invocation (1s, 2s, 3s, etc.).

LINKIWI avatar Aug 10 '24 14:08 LINKIWI