lettuce icon indicating copy to clipboard operation
lettuce copied to clipboard

Topology refresh on consistent timeout

Open GilboaAWS opened this issue 2 years ago • 2 comments

Bug Report

Current Behavior

While working with Lettuce against Redis cluster, when one of the nodes gets stuck, but doesn't crash, e.g. catching the process by gdb, the node doesn't reply, which leads to ops timeout. In this case, the node is considered as FAIL/PFAIL to the other nodes, but Lettuce has no idea about it. All the topology refresh option, the periodic and the adaptive don't contain a timeout issue. The closest adaptive trigger is the PERSISTENT_RECONNECTS, but In this case, the connection watchdog sees everything is ok as the tcp is in the kernel that keeps on buffering the data to the stuck Redis node.

I know timeouts can occur by many reasons, e.g. low command timeout with a huge key-value, or just unreasonable command timeout, but I think it's something that should be configurable.

Expected behavior/code

A topology refresh upon timeouts

Environment

  • Lettuce version(s): 6.0.5.RELEASE
  • Redis version: 6.2.5

Possible Solution

An option to trigger a topology refresh upon a timeout. To add a mechanism that counts the amount of timeouts in a configurable period of time and trigger an adaptive topology refresh if it exceeds.

GilboaAWS avatar Jan 25 '22 15:01 GilboaAWS