bazel-buildfarm [redis] allow the client to reconnect on redis exceptions

Problem:

When workers have issues connecting to redis, it results in exceptions being thrown from the worker's backplane. Since the backplane is used in many places throughput the worker, this can lead to either the worker crashing, or worse, getting stuck in a broken state where it stops processing actions. We want the ability to make server/worker more resilient around redis issues.

Goal:

A worker should handle all redis exceptions in a single place and have the ability to re-connect the client. The cluster should not become broken when redis goes down. Servers / workers should wait for redis to become available again, reconnect, and continue operating as normal. A build client should not fail because redis was unavailable. Luckily, all interactions with redis occur in the RedisClient.java. Therefore, the majority of these goals should be achievable by handling issues within the client.

Changes:

allow the client to reconnect on redis exceptions
prevent exceptions from surfacing to other code getting servers / workers stuck.
print metrics on redis failures so we can track their occurrence

Testing:

Here's what I saw anecdotally from this change. If redis goes down in the middle of a build, the following occurs:

server stays up and keeps trying to reconnect to redis
worker stays up and keeps trying to reconnect to redis
build client does not fail, and its actions make no progress.

When redis comes back the following occurs:

server reconnects and continues normal operations
worker reconnects and continues normal operations
build client continues making progress and completes a successful build.

Mar 30 '23 22:03 luxe

@luxe nice! I was wondering if you able to continue a worker after redis transient failure after this PR end to end?

Added a testing section to the PR. I've observed that the servers/workers can keep running while redis is down. However, new workers will fail to start up when redis is down. We should address that use-case as well.

Apr 09 '23 03:04 luxe

@luxe nice! I was wondering if you able to continue a worker after redis transient failure after this PR end to end?

Added a testing section to the PR. I've observed that the servers/workers can keep running while redis is down. However, new workers will fail to start up when redis is down. We should address that use-case as well.

Fixed. If you also start a worker without redis, it will keep trying to establish the client on startup. When redis is available, the worker will complete its startup.

Apr 09 '23 06:04 luxe

Yeah I think this will definitely make it easier to deal with redis timeouts. Recovering from MatchStage hang I've got due to this elasticache event may be handled here if a critical redis call raises an exception inside of the MatchStage? FWIW it took about 1 hour for that update to go through so it might not have been a standard outcome. I'm going to try this PR a bit today too

Apr 05, 2023 1:15:22 PM build.buildfarm.worker.PipelineStage run
SEVERE: MatchStage::run(): stage terminated due to exception
redis.clients.jedis.exceptions.JedisDataException: UNBLOCKED force unblock from blocking operation, instance state changed (master -> replica?)
        at redis.clients.jedis.Protocol.processError(Protocol.java:132)
        at redis.clients.jedis.Protocol.process(Protocol.java:166)
        at redis.clients.jedis.Protocol.read(Protocol.java:220)
        at redis.clients.jedis.Connection.readProtocolWithCheckingBroken(Connection.java:389)
        at redis.clients.jedis.Connection.getBinaryBulkReply(Connection.java:299)
        at redis.clients.jedis.Connection.getBulkReply(Connection.java:289)
        at redis.clients.jedis.Connection$1.call(Connection.java:283)
        at redis.clients.jedis.Connection$1.call(Connection.java:280)
        at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:317)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1589)

Apr 10 '23 20:04 jerrymarino

This definitely helps with some of the redis problems I was hitting, working good in general. The only possible thing I thought of was back off: perhaps could propose as a followup: if there was packet loss in a big spike backoff could help recovery, but overall this is a large improvement partying_face

Backoff is a good idea. We should do it as followup. Our redis cluster is currently taking thousands of requests per second. So when the client enters this "retry" state now its already significantly less traffic than normal.

There's also a lot of places in buildfarm where we want to rety things, so I like the idea of having the same retry framework everywhere with configurable backoffs; ex: all the network calls between server/worker. The queues which have been nonblocking, etc.

Apr 13 '23 20:04 luxe

There's also a lot of places in buildfarm where we want to rety things, so I like the idea of having the same retry framework everywhere with configurable backoffs; ex: all the network calls between server/worker. The queues which have been nonblocking, etc.

Yeah this would help improve some of the failure modes I've put into recently 👍! Happy to help where I can on it

Apr 14 '23 19:04 jerrymarino