gubernator icon indicating copy to clipboard operation
gubernator copied to clipboard

Peer list update bug in K8s cluster

Open MAXEE998 opened this issue 2 years ago • 3 comments

We ran a three-replica gubernator setup in our k8s cluster. When one pod was shut down gracefully by K8s, another pod (not all, just one) kept reporting

level=error msg="Error in client.GetPeerRateLimits" batchTimeout=500ms category=gubernator error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" queueLen=2

in the log.

Apparently, it didn't update its peer list accordingly. What may be the cause of this problem?

MAXEE998 avatar Sep 12 '23 22:09 MAXEE998

The problematic pod keeps trying to get rate limits from the shutdown peer according to the log:

rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 100.103.255.29:81: i/o timeout"

MAXEE998 avatar Sep 13 '23 20:09 MAXEE998

I don't run a k8s cluster, so I really don't have a way to test this. I rely on the community to provide support for k8s.

thrawn01 avatar Sep 25 '23 16:09 thrawn01

FYI, this isn't limited to k8s. We run on ECS and see something similar. These logs seem to coincide with our deployments.

time="2023-10-25T23:47:56Z" 
level=error msg="error sending global hits to '10.0.37.143:9990'" 
category=gubernator 
error="Error in client.GetPeerRateLimits: rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing: dial tcp 10.0.37.143:9990: connect: connection refused\""

I need to do some research on my end to see if it's a bug on our service or on this library.

miparnisari avatar Oct 26 '23 19:10 miparnisari