StackExchange.Redis icon indicating copy to clipboard operation
StackExchange.Redis copied to clipboard

`ConnectionMultiplexer.IsConnected` not detecting lost connections

Open allyn-psyonix opened this issue 3 years ago • 5 comments

I have a situation where IsConnected continues to return true for ~20 minutes after the connection is gone. Commands are still being issued by the web service, which time out. I would expect the connection state to be updated to reflect the lost connection after timeouts start appearing.

Unfortunately this only happens in deployed environments. Locally, StackExchange.Redis detects the lost connection immediately, but when deployed to various environments it takes ~20 minutes as mentioned above.

allyn-psyonix avatar Feb 24 '22 20:02 allyn-psyonix

Is this on Linux? We've seen this with huge socket timeouts configured at the OS level before is the reason I ask.

NickCraver avatar Feb 24 '22 22:02 NickCraver

Yes, the deployed environment is a linux-based docker image run in kubernetes.

allyn-psyonix avatar Feb 24 '22 22:02 allyn-psyonix

I ran a local test using the same Dockerfile that is used to generate instances of the service for remote deployments and ran it in my local docker. It properly detected the connection interruption immediately (less than a second I'd say).

allyn-psyonix avatar Feb 24 '22 22:02 allyn-psyonix

Going to work with our devops team to look at the rest of the deployed network configuration. Could be a timeout somewhere along the path we have configured that is super high.

allyn-psyonix avatar Feb 24 '22 22:02 allyn-psyonix

Details on the ~15min connection stalls on Linux: https://github.com/StackExchange/StackExchange.Redis/issues/1848#issuecomment-913064646

Kubernetes environments can also see connection problems due to various reasons including: noisy neighbor pods, node maintenance, or Envoy's sidecar pods intercepting network traffic. If all else fails, a packet capture might provide some insights.

philon-msft avatar Feb 24 '22 22:02 philon-msft

Best info we have is above - closing out here to cleanup.

NickCraver avatar Aug 21 '22 14:08 NickCraver

Update: a new version 2.7.10 has been released, including #2610 to detect and recover stalled sockets. This should help prevent the situation where connections can stall for ~15 minutes on Linux clients.

philon-msft avatar Dec 12 '23 17:12 philon-msft